Part 1 – The Simplest ANN

A neural network is a supervised machine learning technique that can be used for classification or regression problems. We will focus here on classification problems only.

We will describe here a very basic artificial neural network (ANN), only composed of an input neuron and an output neuron. This is network will not be able to predict anything useful but it’s a good way to understand the general concepts behind ANNs.

A very basic ANN

This ANN maps an input x (there is only one feature here) to a predicted output \hat{y}. We define a weight w and a bias b between the 2 layers. The role of the ANN is to adjust the weight and the bias.

\hat{y} is the output of an activation function such as the Sigmoid function. This function squashes a real number into the [0, 1] interval. If the real number is negative, the output is close to 0 and if it is positive, the output is close to 1.

The Sigmoid function – source: Wikipedia

(1)   \begin{equation*}\sigma(z) = \frac{1}{1+e^{-z}}\end{equation*}

We can write the predicted output as follows,

(2)   \begin{equation*}\hat{y} = \sigma(z)\end{equation*}


(3)   \begin{equation*}z = xw+b\end{equation*}

In order to adjust the weight and bias, we need to define a cost function J. In this case, we will use the Mean Square Error (MSE) but we could use other ones.

(4)   \begin{equation*}J = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i-y_i)^2\end{equation*}

Where m is the number of observations, y is the label of each observation and \hat{y} is the label predicted by our network.

In order to minimise the mean square error (i.e. the cost function J), we will use the method of the Gradient Descent applied to the weight.

(5)   \begin{equation*}w := w - \alpha \frac{dJ}{dw}\end{equation*}

Where \alpha is the learning rate. We will use the chain rule of differentiation to calculate the gradient.

(6)   \begin{equation*}\frac{dJ}{dw} = \frac{dJ}{d\hat{y}} \frac{d\hat{y}}{dz} \frac{dz}{dw}\end{equation*}

Using previous equations, we can calculate each of the terms.

    \begin{align*} \frac{dJ}{d\hat{y}} = \frac{2}{m}(\hat{y}-y)  \\\frac{d\hat{y}}{dz} = \sigma'(z) = \sigma(z) (1 - \sigma(z)) \\\frac{dz}{dw} = x\end{align*}

We can do the same for the bias.

(7)   \begin{equation*}b := b - \alpha \frac{dJ}{db}\end{equation*}

(8)   \begin{equation*}\frac{dJ}{db} = \frac{dJ}{d\hat{y}} \frac{d\hat{y}}{dz} \frac{dz}{db}\end{equation*}

    \begin{align*} \frac{dJ}{d\hat{y}} = \frac{2}{m}(\hat{y}-y)  \\\frac{d\hat{y}}{dz} = \sigma'(z) = \sigma(z) (1 - \sigma(z)) \\\frac{dz}{db} = 1\end{align*}

That’s it! If we iterate the process enough times, the cost function will decrease and the weight and bias will converge to their optimal values.

In the next article, we will add a hidden layer between between the input and output layer to illustrate how the weights and biases are adjusted using Backpropagation.