Step by Step Back Propagation:

A very detailed step by step Back Propagation Example.


Backpropagation is the most common method for training any neural network. You can find various papers here and there regarding backpropagation. However, most undergraduate and grate guys like me struggle to understand the equation for backpropagations especially when they involve tons of notations and you have scroll back every time to check which notation means what.  In this post, I'll try my best to explain how it works with a simple example and a pseudo code that can be applied to any number of layers. For a better understanding, y'all should also perform the calculations in order to get a grip on what is going on.

Do check my post on multiclass perceptron classification in case you are interested. Click here to visit the post.

I'll begin with a 3 layer network. The first layer is the input layer obviously. The second layer is the hidden layer and the third layer is the output layer.

I am using the minimal notation for the weights, input layer, and output layer. It's the minimum notation that is required to solve backpropagation.

Now there are no computations going inside the input layer. Computations only go on in the hidden and output layer. 

Activation Function:

We will go ahead with the Sigmoid function as our activation function. You can freely choose yours.
The sigmoidal equation is as follows

I have named the weights with each node in an easy way to remember. Lets decide the output of output layers first.

Error Function:
The error function determines the error associated with the output we received from our network with respect to the actual output.

The error associated with the output and actual output will be calculated using mean square error. The equation is 

Forward Pass:

It involves calculating the output of each neuron after the multiplication of input by weights and through the activation function.

The sigmoid equation is
The output from the hidden layer is

Now we have to calculate the gradient descent in order to go ahead with the backpropagation. Gradient descent is the differentiation of error with respect to individual weights. Let's go ahead with W7 weight.

Backward Pass:

It involves calculating the gradient and updating each weight so they can move a bit closer to the actual output we need.

Always remember the above equations as differentiation will become a lot easy. Now the partial differentiation is not direct as the weight w7 is not directly changing with respect to the error. So to get that change, we have to use the chain rule. As per chain tule, we will get this:

Isn't it simple?
Now let us calculate each part of the above equation. Te first equation will be

The second equation will be as follows:

The third equation will be as follows:

Combining all the equations, we will get our required partial differentiation output as:

The update of the new weight will through the below equation:

In the above equation, alpha is the learning rate.
We can go with the same procedure to calculate the error gradient for other weights.

Now obviously, we shouldn't use the above equations inside a for loop while programming which will increase your training 10 folds. The solution? Vector mathematics !!

For the network at the start of the post, the weight matrix is as follows:

The hidden layer which will act as an input here to calculate the output for the hidden layer is as follows:
The dot product i.e. output will be as follows:

Now we have calculated y. Now we need to calculate the output that we will get after the above matrix passes through the activation function of Sigmoid. To do that in a vector way we can use numpy function.

Sigmoid Matrix:
This will work if you are using numpy in your python code. The output will be:
Final Weight Calculation:
Now the error gradient for the entire output layer and the final updated weight matrix will be:

The O matrix is the output matrix so don't get confused!
The above equation will update the entire hidden to output matrix in a single go. Isn't that better than updating each weight individually which is gonna cost you time? Also, it is important to know that all weights are updated at once. You can't update the output weight matrix first and then the input weight matrix.  All weight updates are performed with original weights and not with updated weights.

Now have a look at the following equation again:

Transforming the above equation:

We will use the above equations later to make things easy.

Hidden Layer Weights.

To get the gradient of Error w.r.t the weights from input to the hidden layer, we have to follow the chain rule again.

Have a look at the following diagram.

So getting down to the equation to find gradient with respect weight 1.

Now coming to the main part to make equations more easy using memoization.

Thus its clear now with the above equations about the memorization process to find all gradients. Suppose, we have a 4 layer network i.e. with 2 hidden layers then the equations be as follows:

At last after calculating the new weights, we will update the weights all at once.

A Request: Do let me know, if you find any mistakes.

Multiclass Perceptron Implementation

Implement a Multiclass Perceptron

In this post, I will explain the working of a multilayer perceptron. We all know that perceptrons have a unit step function as an activation function. So the output will obviously be either 0 or 1. If the computed output is greater than 0 we set the outcome as 1 else 0. This is useful only when we have to classify between two labels. But how do we classify more than 2 labels using a perceptron? Things become easy only when we have two labels and not more than two. 

So in this post, I'll deal with the Iris flower data set. It has 3 flowers namely, Iris-Setosa, Iris-Versicolor, and Iris-Virginica. The tricky part starts now. We have 3 class and we have to train a "Perceptron" to classify among the three flowers. So how do we go ahead?

In order to classify between multiple classes, we will initially need to train two classes at once using the same perceptron and then repeat the same procedure with other classes as well. Let us take an example of the Iris-Flower Data set. We have 3 flowers namely Iris-Setosa, Iris-Versicolor, and Iris-Virginica. Let's begin.

The pseudo-code will be:

1. Train a Perceptron to classify Iris-Setosa and Iris-Versicolor
2. Train a Perceptron to classify Iris-Versicolor and Iris-Virginica
3. Train a Perceptron to classify Iris-Virginica and Iris-Setosa
4. Choose the maximum from each output and boom that should be your prediction among the 3 classes.

We will need two layers for a simple classification. The input will have 4 nodes. Each node will be Petal length, Petal width, Sepal length, Sepal width. The output will obviously be one flower.

The input that we have here is in form as stated below:
0.1, 0.2, 0.3, 0.4, Iris-Setosa

Lets us begin with the code part:

I am using NumPy on Python 3.7. You can use Pandas too. No restrictions here. Initially, we import the numpy module.

If you have it installed you are good to go. Else you can go to launch Command Prompt. Then type:

cd "Your Path where Python is installed without quotes" 

Then type
pip install numpy

I hope everything goes successfully. After importing is successful, we will create a perceptron class which will have functions like train to train the network and test to test the changes on the test data. The constructor will declare the variables required for the code execution. I am coding on the basis of the Pocket Algorithm with the ratchet. I am assuming that you all know about the algorithm or else I will create a new post regarding Perceptron.

Part A: The Perceptron Class

What is __init__ keyword?
__init__ is a reserved keyword. It is a constructor that is called whenever the class object is instantiated.

What is self in Python?
Self is a reserved keyword. It represents an instance of the class and can be used to access the attributes and variables within various methods inside the class.

We have 5 self variables here. The variables

1. Self.vector will store the input features as a numpy array
2. Self.weights will store the input weights for our input features
3. Self.label will store the labels for our flower data set.
4. Self.pocket_weight will store the best possible weight for our network.
5. Self.learning will store the learning rate.
6. Self.bias will store the initial bias value and will also store the updated value of the bias.

Now we will come down to the train method. In this method, we will train our network. Initially, I went with 250 epochs which gave me 93.33 % accuracy on a small dataset. We will be going ahead with the Pocket Algorithm for this classification with a ratchet. We will also test each updated weight on the training data and store the accuracy. If the new accuracy returned by the test method is greater than the "pocket" accuracy, we will store the updated weight vector in our "pocket". If the accuracy is low, we will not store that weight in the pocket and move on with training and weight update.

The pseudo-code for our Pocket Algorithm Training will be like:

Initialize Weight Vector as 0
Initialize CurrentAccuracy as 0
     Loop in Epochs:
     Test the Training data with the current weight.
     If accuracy with current weight vector is greater than CurrentAccuracy:
           Store the weight vector
           Update the Weight Vector
           Update the Weight Vector
     Output = w.x + bias
     If Output >= 0
           Output = 1
           Output = 0
     Weight = Weight + Learning_rate*(target-output)*x
     Bias = Bias + Learning_rate*(target-output)

The bias is also trained by the network except it is not multiplied with the input vector.

The pocket_weight method will return the best weight which gave us the highest accuracy. The bias method will return the final trained bias. Since the input file stores all input data of all flowers, we will have separate the flowers as explained at the start of the post. We will create 3 numpy arrays. Array x1 will store the input data for Iris-Setosa and Iris-Versicolor. The array x2 will store the input data for Iris-Versicolor and Iris-Virginica. Similarly, array x3 will store the input data for Iris-Versicolor and Iris-Virginica. The following code in the snapshot will perform the above operation of separating our datasets for individual training.

Fig. Separation on Input Data Set for individual training

We will also separate labels likewise. Label_1 will be for training set x1. Similarly, label_2 will be for training set x2 and so on.  I have initialized all the weight vectors for each set as 0. You can initialize the weight vector with a random number as well. It's entirely up to your choice but always keep a low value else oscillations will take place. The datatype for the numpy arrays is float32. 

I have also implemented the code to implement a confusion matrix for the test data as well as the training data. The variables are as follows:

TP: True Positive
TN: True Negative
FP: False Positive
FN: False Negative

Here we have 3 Perceptron objects for 3 classifications namely, percep_1, percep_2, and percep_3. percep_1.pocket_weight returns us the stored pocket weight which showed us the best accuracy on our training data of set 1. Similarly, we do this for the other sets as well. As per the above screenshot, we are training each of our 3 sets at once with their own pocket weights thus will get a value upon the dot product of test data and pocket weight. After training each set, we will select the maximum among the three sets here f1, f2, and f3. Remember, each test data will be validated within each set. We won't categorize the test data like we did for the training data. 

The argmax function will return the index of the maximum value among the numpy array. Now the question is how to predict which index is for whom.

F1 is the trained network with Iris-Setosa and Iris-Versicolor with 0 as Iris-Setosa and 1 for Versicolor. 
F2 is the trained network with Iris-Versicolor and Iris-Virginica with 0 as Iris-Versicolor and 1 for Virginica. 
F3 is the trained network with Iris-Virginica and Iris-Setosa with 0 as Iris-Virginica and 1 for Setosa

Now F1 has 1 which is maximum for Versicolor.
Similarly, F2 has 1 which is maximum for Virginica.
Similarly, F3 has 1 which is maximum for Setosa.

Now, the numpy array is arranged in the fashion: F1, F2, F3. So argmax will choose maximum among these, thus 0th index will return Versicolor, 1 will return Virginica and 2 will return Setosa. Similarly, argmin will return 0 for Setosa, 1 for Versicolor, 2 for Virginica.

Coming to the confusion matrix: If the returned value and labeled value is same then we increment the TP and TN by 1 else we will increment FP, FN by 1. False Positive lies on the vertical axis of the Confusion matrix and False Negative lies on the horizontal axis of the Confusion Matrix.

Confused? Have a look at the image below for better understanding.

To get the code for the above implementation, click the below Download Text.

Download Code
Here is the confusion matrix that I got for my test data.
Precision and Recall values are also present.

The output from the above code is:

You can get the training and test data from below gist.
Feel free if any one wants some help on the code.

Self Balancing Robot using Machine Learning

Hola Amigos, I have always loved inverted pendulums. They are very fascinating to me and I play with them a lot. In real life, I have m...