Part 2 – Adding a Hidden Layer

Following up from the previous article, we now want to add a hidden layer of neurons between the output and the input layer.

The layer number is shown into brackets in superscript and the neuron number in each layer is shown in subscript.


We can now write the activation of each neuron as follows,

    \begin{align*} a^{(2)}_1 = \sigma(z^{(2)}_1) \\a^{(1)}_1 = \sigma(z^{(1)}_1) \end{align*}


    \begin{align*} z^{(2)}_1 = a^{(1)}_1 w^{(1)}_{11}+b^{(1)} \\z^{(1)}_1 = a^{(0)}_1 w^{(0)}_{11}+b^{(0)} \end{align*}


The MSE cost function remains the same

(1)   \begin{equation*}J = \frac{1}{m}\sum_{i=1}^{m}(a^{(2)}_1i-y_i)^2\end{equation*}

Let’s apply Gradient Descent for the weight and bias of the last layer,

    \begin{align*} w^{(1)}_{11} := w^{(1)}_{11} - \alpha \frac{dJ}{dw^{(1)}_{11}} \\b^{(1)} := b^{(1)} - \alpha \frac{dJ}{db^{(1)}} \end{align*}

The chain rule now becomes,

(2)   \begin{equation*} \frac{dJ}{dw^{(1)}_{11}} = \frac{dJ}{da^{(2)}_1} \frac{da^{(2)}_1}{dz^{(2)}_1} \frac{dz^{(2)}_1}{dw^{(1)}_{11}} \end{equation*}

Isolating each terms, we have

    \begin{align*} \frac{dJ}{da^{(2)}_1} = \frac{2}{m}(a^{(2)}_1-y)  \\\frac{da^{(2)}_1}{dz^{(2)}_1} = \sigma'(z^{(2)}_1) \\\frac{dz^{(2)}_1}{dw^{(1)}_{11}} = a^{(1)}_1 \end{align*}

Similarly for the bias,

(3)   \begin{equation*} \frac{dJ}{d b^{(1)}} = \frac{dJ}{da^{(2)}_1} \frac{da^{(2)}_1}{dz^{(2)}_1} \frac{dz^{(2)}_1}{d b^{(1)}} \end{equation*}


(4)   \begin{equation*} \frac{dz^{(2)}_1}{d b^{(1)}} = 1\end{equation*}

The novelty is that we also need to calculate the sensitivity of the cost function with respect to the activation of the previous layer a^{(1)}_1. That’s why we call this method Backpropagation.

(5)   \begin{equation*} \frac{dJ}{da^{(1)}_1} = \frac{dJ}{da^{(2)}_1} \frac{da^{(2)}_1}{dz^{(2)}_1} \frac{dz^{(2)}_1}{da^{(1)}_1}} \end{equation*}

The only extra term we need to calculate is

(6)   \begin{equation*} \frac{dz^{(2)}_1}{da^{(1)}_1}} = w^{(1)}\end{equation*}

We can now look at adjusting the weight and bias of the first layer by using the same idea.

Gradient descent for the weight in the first layer

(7)   \begin{equation*} w^{(0)}_{11} := w^{(0)}_{11} - \alpha \frac{dJ}{dw^{(0)}_{11}}\end{equation*}

Chain rule for the weight in the first layer

(8)   \begin{equation*} \frac{dJ}{dw^{(0)}_{11}} = \frac{dJ}{da^{(1)}_1} \frac{da^{(1)}_1}{dz^{(1)}_1} \frac{dz^{(1)}_1}{dw^{(0)}_{11}} \end{equation*}


    \begin{align*} \frac{dJ}{da^{(1)}_1} = \textrm{already calculated} \\\frac{da^{(1)}_1}{dz^{(1)}_1} = \sigma'(z^{(1)}_1) \\\frac{dz^{(1)}_1}{dw^{(0)}_{11}} = a^{(0)}_1 \end{align*}

Gradient descent for the bias in the first layer

(9)   \begin{equation*} b^{(0)} := b^{(0)} - \alpha \frac{dJ}{db^{(0)}} \end{equation*}

Chain rule for the bias in the first layer

(10)   \begin{equation*} \frac{dJ}{db^{(0)}} = \frac{dJ}{da^{(1)}_1} \frac{da^{(1)}_1}{dz^{(1)}_1} \frac{dz^{(1)}_1}{db^{(0)}} \end{equation*}


(11)   \begin{equation*} \frac{dz^{(1)}_1}{db^{(0)}} = 1\end{equation*}

We now have all the bits and pieces to update the weights and biases of all the layers. We won’t look at the implementation into Python code just yet because this example is a bit meaningless in practice since there is only one input neuron i.e. one feature. (So the best classification we can get in this case is input = output).

In the next article, we will look at implementing a single-layer neural network with 3 input neurons (also known as Perceptron).