Creating an MLP in C++

The here presented neural net is implemented in a way that makes understanding its structure easy. This comes at the cost of actual runtime performance.

The goal of this model will be to accuratly predict the Iris dataset.

Most of the C++ code shown here does not abide by the rule of 3 (or 5). The code shown is simplified and should not be used in production. Find the full code here: Github.

The MLP we want to build should be able to accuratly predict the Iris dataset. As this task is not very complicated the required MLP does not have to be very complex. It will consist of only one hidden layer of size 30 and with a ReLU activation function. The output layer features the three possible output classes of the dataset as one-hot encoding. Therefore it has 3 outputs with a SoftMax as activation function. Finally the loss is calculated using CrossEntropy loss.

Structure of an Basic MLP

Classes we need to implement

We need to implement the following classes:

Dense Layer
CrossEntropy Loss
SGD Optimizer
Relu Activation
Softmax Activation

As data structure to hold the memory that needs to be allocated we will use the Templated Tensor class.

Fully connected / Dense layer

Given:

Input_size N
Output_size M

Learnable parameters:

Weights $w \in ℝ^{N \times M}$
Bias $b \in ℝ^{M}$

Data to save:

previous input $v_{i} \in ℝ^{N}$

Forward

Given input $v_{i} \in ℝ^{N}$ we get output $v_{o} \in ℝ^{M}$ using $f (v_{i})$ .

f (v_{i}) = b + v_{i} w = v_{o}

python

def forward(self, input_tensor):
    self.previous_input=input_tensor.copy()
    output = self.bias + np.matmul(input_tensor,self.weights)
    return output

This can be rewritten by appending $b$ to $w$ and creating $w_{b} \in ℝ^{(N + 1) \times M}$ .

f (v_{i}) = [\begin{array}{cc} v_{i} & 1 \end{array}] [\begin{array}{c} w \\ b \end{array}] = {\hat{v}}_{i} w_{b} = v_{o}

python

def forward(self, input_tensor):
    #add ones to the input in order to add the bias 
    input_tensor=np.c_[input_tensor,np.ones(input_tensor.shape[0])]
    self.previous_input=input_tensor.copy()
    #calculate the output. Here weights already include the bias
    output = np.matmul(input_tensor,self.weights)
    return output

Backward

Given error $r \in ℝ^{M}$ :

b (r_{i}) = r_{i} w^{T}

Then given an optimizer $O$ :

\begin{aligned} Δ w & = v_{i}^{T} r \\ Δ b & = \sum_{i}^{N} r_{i} \\ w_{b} & = O (w, b, Δ w, Δ b) \end{aligned}

python

def backward(self, error_tensor):
	output = np.matmul(error_tensor, self.weights.T)
	if self.optimizer is not None:
		self._gradient_weights = np.matmul(self.previous_input.T, error_tensor)
		self._gradient_bias = error_tensor.sum(axis=0)
		self.weights, self.bias = self.optimizer(self.weights, self.bias, self._gradient_weights, self._gradient_bias)
	return output

Or using $w_{b}$ :

python

def backward(self, error_tensor):
	output = np.matmul(error_tensor, self.weights.T)
	if self.optimizer is not None:
		self._gradient_weights = np.matmul(self.previous_input.T, error_tensor)
		self.weights = self.optimizer.calculate_update(self.weights, self._gradient_weights)
	return output

See C++ code for the dense layer

CrossEntropy Loss

Data to save:

previous input $x$

Forward

$l$ is the label tensor consisting of one-hot encoded labels

f (x, l) = \sum - \log (x_{l = 1} + ϵ)

python

def forward(self, prediction_tensor, label_tensor):
	self.previous_input=prediction_tensor.copy()
	# sum of each vector in prediction_tensor == 1
	# take the log of every prediction with label == 1 and sum them. eps for log(0) prevention
	loss = np.sum( - np.log( prediction_tensor[label_tensor==1] + np.finfo(float).eps) )
	return loss

Backward

$l$ is the label tensor consisting of one-hot encoded labels

f (l) = - \frac{l}{x + ϵ}

python

def backward(self, label_tensor):
	return -label_tensor / (self.previous_input + np.finfo(float).eps)

See C++ code for the CrossEntropyLoss

SGD Optimizer

Given:

learning_rate $μ$

Learnable parameters: None

Data to save: None

Update

w_{i + 1} = w_{i} - μ \cdot Δ w_{i}

python

def update(self, weight_tensor, gradient_tensor):
	updated_weights = weight_tensor - self.learning_rate * gradient_tensor
	return updated_weights

The same function can be used for the bias

b_{i + 1} = b_{i} - μ \cdot Δ b_{i}

or both combined

w_{b, i + 1} = w_{b, i} - μ \cdot Δ w_{b, i}

See C++ code for the SGD Optimizer

ReLu Activation

Given:

Input tensor $x$

Learnable parameters: None

Data to save:

Previous input $v_{i} \in ℝ^{N}$

Forward

f (x) = {\begin{matrix} x & x \geq 0 \\ 0 \end{matrix}

python

def forward(self, input_tensor):
	self.previous_input=input_tensor.copy()
	#set every negative value to 0
	input_tensor[input_tensor<0]=0
	return input_tensor

Backward

Given error $y$ and previous input $x$ :

b (y) = {\begin{matrix} y & x \geq 0 \\ 0 \end{matrix}

python

def backward(self,error_tensor):
	#set every value where the input was negative to 0
	error_tensor[self.previous_input<0]=0
	return error_tensor

See C++ code for the ReLU activation function

SoftMax Activation

Given:

Input tensor $x$
Learning rate $μ$

Learnable parameters: None

Data to save: None

Forward

f (x) = \frac{\exp (x)}{\sum (\exp (x))}

python

def forward(self, input_tensor): 
	# shift x to increase numerical stability x_new = x - max(x)   
	input_tensor = input_tensor - np.max(input_tensor)
	# calc. exp(x) here. Otherwise we would have to do it twice
	input_tensor = np.exp(input_tensor)
	# np.sum: axis=1 -> we want to get sum per input
	# * np.ones will make the sum of input to vec of size input
	input_tensor = input_tensor / (np.sum(input_tensor, axis=1, keepdims=True) * np.ones(input_tensor.shape))
	self.previous_output = input_tensor.copy()
	return input_tensor

Backward

b (y) = y \cdot (1 - y)

python

def backward(self, error_tensor):
	# calc inner sum
	row_sum = np.sum(error_tensor * self.previous_output, axis=1, keepdims=True)
	error_tensor -= row_sum
	error_tensor *= self.previous_output
	return error_tensor

See C++ code for the SoftMax activation function

Creating the Neural Net

We want a simple neural net consisting of 2 dense layers.

See C++ code for the Neural net

If we now define a Dataloader it will be easier to Handle our training data.

See C++ code for the DataLoader

See C++ code Iris dataset read

And finally the main function:

See C++ code for the main function

Training this model results in a train loss of 0.0077 and a validation loss of 0.0063 after 1000 iterations. All 45 validation data values have been predicted successfully.