Part 3 – The Single-layer Perceptron

We propose to implement a simple ANN from scratch using only the Numpy library. Although it is more efficient to use deep learning libraries such as Tensorflow or Pytorch, the motivation is to have a better understanding of how ANN work.

We will look at implementing an ANN with 3 input neurons. This means that our problem has 3 features. Here is the architecture.

w^{(L)}_{ij} is the weight between the i^{th} neuron in layer L-1 and the j^{th} neuron in layer L.


The activation of the predicted output \hat{y} can be written as follows,

(1)   \begin{equation*} \hat{y} = a^{(1)}_1 = \sigma(z^{(1)}_1) \end{equation*}


(2)   \begin{equation*} z^{(1)}_1 = a^{(0)}_1 w^{(0)}_{11} + a^{(0)}_2 w^{(0)}_{21} + a^{(0)}_3 w^{(0)}_{31} + b^{(0)} \end{equation*}


The MSE cost function is defined as follows,

(3)   \begin{equation*}J = \frac{1}{m}\sum_{i=1}^{m}(a^{(1)}_1i-y_i)^2\end{equation*}

Let’s apply Gradient Descent to the first weight w^{(0)}_{11}.

(4)   \begin{equation*} w^{(0)}_{11} := w^{(0)}_{11} - \alpha \frac{\partial J}{\partial  w^{(0)}_{11}} \end{equation*}

The chain rule now becomes,

(5)   \begin{equation*} \frac{\partial J}{\partial w^{(0)}_{11}} = \frac{\partial J}{\partial a^{(1)}_1} \frac{\partial a^{(1)}_1}{\partial z^{(1)}_1} \frac{\partial z^{(1)}_1}{\partial w^{(0)}_{11}} \end{equation*}

Isolating each terms, we have

    \begin{align*} \frac{\partial J}{\partial a^{(1)}_1} = \frac{2}{m}(a^{(1)}_1-y)  \\\frac{\partial a^{(1)}_1}{\partial z^{(1)}_1} = \sigma'(z^{(1)}_1) \\\frac{\partial z^{(1)}_1}{\partial w^{(0)}_{11}} = a^{(0)}_1 \end{align*}

If we repeats the same steps for the other 2 weights and the bias, we get

    \begin{align*} \frac{\partial J}{\partial w^{(0)}_{11}} = \frac{2}{m}(a^{(1)}_1-y)  \sigma'(z^{(1)}_1) a^{(0)}_1 \\\frac{\partial J}{\partial w^{(0)}_{21}} = \frac{2}{m}(a^{(1)}_1-y)  \sigma'(z^{(1)}_1) a^{(0)}_2 \\\frac{\partial J}{\partial w^{(0)}_{31}} = \frac{2}{m}(a^{(1)}_1-y)  \sigma'(z^{(1)}_1) a^{(0)}_3 \\\frac{\partial J}{\partial b^{(1)}} =  \frac{2}{m}(a^{(1)}_1-y)  \sigma'(z^{(1)}_1)\end{align*}

We are now ready to implement!

Numpy implementation

The implementation is inspired from this article.

The problem

Before tackling the implementation itself, we need to define a problem to solve. Let’s build a toy dataset for a simple classification problem. Suppose we have some information about obesity, smoking habits, and exercise habits of five people. We also know whether these people are diabetic or not. We can encode this information as follows:

Person 10101
Person 20010
Person 31000
Person 41101
Person 51111

“In the above table, we have five columns: Person, Smoking, Obesity, Exercise, and Diabetic. Here 1 refers to true and 0 refers to false. For instance, the first person has values of 0, 1, 0 which means that the person doesn’t smoke, is obese, and doesn’t exercise. The person is also diabetic.

It is clearly evident from the dataset that a person’s obesity is indicative of him being diabetic. Our task is to create a neural network that is able to predict whether an unknown person is diabetic or not given data about his exercise habits, obesity, and smoking habits. This is a type of supervised learning problem where we are given inputs and corresponding correct outputs and our task is to find the mapping between the inputs and the outputs.”

from [1]

The code

We will base our implementation on the neural network architecture described above.

We start by importing some libraries and defining the Sigmoid function and its derivative. We then define our data set and the hyperparameters of the model. During the training phase, we perform feedforward and backpropagation steps. We can then plot the evolution of the cost, weights and bias with the number of iterations (epoch). Finally, we test our neural network on some unseen examples.

In example 1, a person who is smoking, not obese and does not exercise is classified as not diabetic. In example 2, a person who is not smoking, obese and does not exercise is classified as diabetic.

The training error (MSE) keeps decreasing with the number of iterations, which is a good sign. We can also notice that the weight w_{21} becomes predominant after many iterations. This is because the 2nd feature (obesity) is very highly correlated with the output variable (diabetic).


This article described the theory of a very simple neural network with one input layer and one output layer. It was implemented in plain Numpy and applied to a simple classification problem with 3 features and 5 observations.

This type of neural network is classed a Perceptron and it is capable of classify linearly separable data. However, most real-world problems require to identify non-linear decision boundaries. In the next article, I will describe how multi-layer perceptron can be used to estimate non-linear decision boundaries.