## A very detailed step by step Back Propagation Example.

#### Background

Backpropagation is the most common method for training any neural network. You can find various papers here and there regarding backpropagation. However, most undergraduate and grate guys like me struggle to understand the equation for backpropagations especially when they involve tons of notations and you have scroll back every time to check which notation means what.  In this post, I'll try my best to explain how it works with a simple example and a pseudo code that can be applied to any number of layers. For a better understanding, y'all should also perform the calculations in order to get a grip on what is going on.

Do check my post on multiclass perceptron classification in case you are interested. Click here to visit the post.

I'll begin with a 3 layer network. The first layer is the input layer obviously. The second layer is the hidden layer and the third layer is the output layer.

I am using the minimal notation for the weights, input layer, and output layer. It's the minimum notation that is required to solve backpropagation.

Now there are no computations going inside the input layer. Computations only go on in the hidden and output layer.

#### Activation Function:

We will go ahead with the Sigmoid function as our activation function. You can freely choose yours.
The sigmoidal equation is as follows

I have named the weights with each node in an easy way to remember. Lets decide the output of output layers first.

Error Function:
The error function determines the error associated with the output we received from our network with respect to the actual output.

The error associated with the output and actual output will be calculated using mean square error. The equation is

#### Forward Pass:

It involves calculating the output of each neuron after the multiplication of input by weights and through the activation function.

The sigmoid equation is
The output from the hidden layer is

Now we have to calculate the gradient descent in order to go ahead with the backpropagation. Gradient descent is the differentiation of error with respect to individual weights. Let's go ahead with W7 weight.

#### Backward Pass:

It involves calculating the gradient and updating each weight so they can move a bit closer to the actual output we need.

Always remember the above equations as differentiation will become a lot easy. Now the partial differentiation is not direct as the weight w7 is not directly changing with respect to the error. So to get that change, we have to use the chain rule. As per chain tule, we will get this:

Isn't it simple?
Now let us calculate each part of the above equation. Te first equation will be

The second equation will be as follows:

The third equation will be as follows:

Combining all the equations, we will get our required partial differentiation output as:

The update of the new weight will through the below equation:

In the above equation, alpha is the learning rate.
We can go with the same procedure to calculate the error gradient for other weights.

Now obviously, we shouldn't use the above equations inside a for loop while programming which will increase your training 10 folds. The solution? Vector mathematics !!

For the network at the start of the post, the weight matrix is as follows:

The hidden layer which will act as an input here to calculate the output for the hidden layer is as follows:
The dot product i.e. output will be as follows:

Now we have calculated y. Now we need to calculate the output that we will get after the above matrix passes through the activation function of Sigmoid. To do that in a vector way we can use numpy function.

Sigmoid Matrix:
This will work if you are using numpy in your python code. The output will be:
Final Weight Calculation:
Now the error gradient for the entire output layer and the final updated weight matrix will be:

The O matrix is the output matrix so don't get confused!
The above equation will update the entire hidden to output matrix in a single go. Isn't that better than updating each weight individually which is gonna cost you time? Also, it is important to know that all weights are updated at once. You can't update the output weight matrix first and then the input weight matrix.  All weight updates are performed with original weights and not with updated weights.

Now have a look at the following equation again:

Transforming the above equation:

We will use the above equations later to make things easy.

#### Hidden Layer Weights.

To get the gradient of Error w.r.t the weights from input to the hidden layer, we have to follow the chain rule again.

Have a look at the following diagram.

So getting down to the equation to find gradient with respect weight 1.

Now coming to the main part to make equations more easy using memoization.

Thus its clear now with the above equations about the memorization process to find all gradients. Suppose, we have a 4 layer network i.e. with 2 hidden layers then the equations be as follows:

At last after calculating the new weights, we will update the weights all at once.

A Request: Do let me know, if you find any mistakes.