Neural networks I

Data analysis and machine learning

Neural networks have become overwhelmingly popular in the last 30 years.That's right. The idea of neural networks for supervised learning has existed for thirty years!

The original idea for neural networks stems from a simplification of how we believe humans think. Our brains have of order 100 billion neuron cells. Neurons are connected in a complex web, and the connections between each neuron have varying widths. The varying widths between connected neurons means that each connection differs slightly in its electrical conductivity. The simplification of how humans think is that each time we think about something or remember something, a series of neurons send electrical signals between each other. For each neuron, the value of the incoming electrical charge can be sufficient to 'activate' (or 'fire') that neuron, and allows it to send an output signal to other neurons. Naively one might think that if you connected enough neurons in the right way, you may be able to learn or think.

This view is too simplistic to describe the human brain, or even to describe a single neuron! Nevertheless, the mathematical description of these simplified neurons are connected — which is what we refer to when we say "neural network" — has shown to be extremely flexible in learning structure from large data sets. Indeed, it can be shown that a sufficiently large neural network is a universal interpolator: if the training set is sufficiently large, a neural network can learn to predict any relationship between the training set inputs and their outputs. Being universal interpolators makes them extremely flexible, which is why we have seen their use emerging in nearly every field.

In this lesson we will cover a brief history of neural networks so that you understand the relevant terminology, and introduce the basic pros and cons. At the end of this lesson we will cover back-propagation: one of the crucial ingredients to why neural networks work at all!The history of neural networks is very interesting. When they were first described the authors of the paper thought "well these seem equally great and useless" because it was thought that you would never be able to train a neural network to do anything useful. To do anything useful you would need a huge neural network, and it seemed impossible to train such a network unless you had an imaginably large computer, and you had a clever way to determine the neural network weights, which seemed equally as impossible a task! For these reasons, neural networks (or perceptrons as they were known at the time) did not receive much attention for many years. It was only when computers became fast enough, and once the back-propagation algorithm was developed, that neural networks surged into use today. Interestingly, if you ask those authors what they think will be the next big step in machine learning, they all agree: "It won't be in deep learning. At least not as we currently describe them. It will be in unsupervised learning techniques."

\[ \newcommand\vectheta{\boldsymbol{\theta}} \newcommand\vecdata{\mathbf{y}} \newcommand\model{\boldsymbol{\mathcal{M}}} \newcommand\vecx{\boldsymbol{x}} \newcommand{\transpose}{^{\scriptscriptstyle \top}} \renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\cov}{\mathbf{C}} \]

A mathematical simplification of a neuron

The basic unit in a neural network is called a neuron.Also often called a node, unit, or cell. A neuron receives inputs from other neurons, or from an external source, and it computes an output. The output of one neuron may be the final output of the network, or the output of one neuron may represent the input to another neuron. Every input — to every neuron — has an associated weight \(w\) which is assigned based on the importance of that input, relative to other inputs to that neuron.

The neuron takes the inputs and applies a non-linear function (called the activation function) to the weighted sum of its inputs. Because the activation function is non-linear, it allows a single neuron to learn a non-linear relationship in the data, and it allows combinations of neurons to learn very complex non-linear relationships in the data.This is essentially the point of neural networks: as a physicist or astronomer you will know how certain parameters or phenomena affect each other, and you can write down a causal relationship between these properties. Neural networks have no such domain knowledge. Instead they will take in huge amounts of training data, and their job is simply to learn relationships in the data. In other words: you can use a neural network to find features in the data, rather than 'engineering features' yourself. Every activation function takes in a single value, performs a mathematical operation, and returns the output of that operation.

Consider what happens to the inputs of this neuron above, before the activation function. The input values \(\vec{x}\) are those from our training set for one observation. In other words, we are showing the three properties \((x_1, x_2, x_3)\) of one observation in the training set (\(D = 3\)). Each input value has an associated weight \(\vec{w}\) that we need to determine. You can see that we are also putting in a fake input value of 1, where we label the weight of that input as \(b\). The value we are providing to the activation function \({f}\) is then If you're thinking: why don't we just write a design matrix that includes the inputs and 1 value and re-cast the problem as \(\vec{Aw}\), then you're right. It's the exact same thing — and a nicer description! But I digress.. \[ y = {f}\left(\vec{w}\vec{x} + b\right) \] which ought to be familiar to you by now!Note the vector notation here. The inputs \(\vec{x}\) are a column vector and the weights \(\vec{w}\) are a row vector such that the product \(\vec{wx}\) is a scalar. The bias \(b\) is also a scalar, as is the output of the activation function. So we will feed in the scalar \(\vec{wx} + b\) to the activation function \(f\), and here we will adopt a sigmoid as the activation function.We will cover different activation functions later this week. \[ f(x) = S(x) = \frac{1}{1 + e^{-x}} \] You will recall that we covered the sigmoid function when we discussed logistic regression. A sigmoid function takes an input value and squishes the output value to be between 0 and 1:

import numpy as np import matplotlib.pyplot as plt from matplotlib.ticker import MaxNLocator x = np.linspace(-6, 6, 1000) S = lambda x: 1 / (1 + np.exp(-x)) fig, ax = plt.subplots(figsize=(4, 4)) ax.plot(x, S(x), c='k', lw=2) ax.set_xlim(*x[[0, -1]]) ax.axvline(0, c="#666666", ls=":", zorder=-1, lw=1) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$S(x)$") ax.xaxis.set_major_locator(MaxNLocator(5)) ax.yaxis.set_major_locator(MaxNLocator(5)) fig.tight_layout()

Now that we have a mathematical simplification of a single neuron, let's combine multiple neurons together into one layer of neurons. Here the term layer is used to describe the structure of how neurons are connected, and the term will come up a lot. Neurons in the same layer are typically not connected to each other. Let us consider that we have some input layer, with two inputs, and then we have five neurons in the same layer. The outputs from that layer are then fed into one neuron (in the output layer) and the output of that network is what gives us our predictions. This is what our network might look like, and this is what early neural networks looked like (back when they were called Perceptrons):

You can see now our input data are being given as inputs to multiple neurons, and each of those neurons has their own weights.If you're thinking: what's to stop each of those neurons finding the exact same weights? The answer, so far, is nothing. You can also see that the output of these neurons goes to another neuron before the output is given. So in total, there are three neurons in this network: two neurons in a hidden layer, and one neuron in the output layer. Given some training data, you would be able to figure out what the values of all the weights are in the neural network.

Neural network architecture

Even though each neuron can learn a non-linear mapping from the data, a three neuron network is going to be pretty simplistic for most realistic problems. For the kinds of problems we're interested in we will want to have a more extensive neural network. There are two ways you could immediately extend the existing network:

  1. increase the number of neurons in the same layer (go wider), and/or
  2. increase the number of layers (go deeper).

Even if you only increased the number of neurons in the same layer, you will find that you can create an infinitely wide single-layer neural network that will be able to learn very complex non-linearities in the data! But if you also increased the number of layers you can see that you will end up with an extremely flexible neural network! Since we are doing "machine learning" (and not "principled data analysis") and we don't care about interpretability, what is stopping us from creating a very wide (many neurons per layer) and very deep (many layers) neural network?

The first problem you will encounter will be optimisation. Even if you just have a very wide single (hidden) layer neural network, you will find this to be an optimisation nightmare if you just use naive strategies. This is because the dimensionality is high, even for a network with one hidden layer, and the problem is extremely non-convex. So you could end up in a bad valley of optimisation space. That also means that your initialisation for the weights will be extremely important. And not only is there no guarantee that you would converge to the global solution (if there is one), but there are no guarantees that you would converge to any solution!

Backpropagation

For all these reasons,And others, which you will learn. neural networks were initially considered a cute idea but intractable for any real problem. Two things changed that:

  1. Computers became faster.
  2. The backpropagation algorithm was developed.

BackpropagationOr backprop, as the kids call it. is the process of comparing the predicted outputs of a neural network with the desired outputs in the training set, and propagating the error backwards to the previous layer. The difference in the error is used to update the weights through the neural network. Weights that contributed to making the worst predictions will be updated the most, and weights that contributed to making reasonable predictions will only be updated a small amount. This process is repeated until the total error is below some threshold.

Backpropagation itself is just a clever use of the chain rule for differentiation, but it is critically important for you to understand it because there are some subtle ways where it can cause everything to fail spectacularly! If you don't understand backpropagation then you will find it much harder to debug why your neural network isn't working.

Let us imagine that we have the following dataset on properties in some city. The data that we have are:

We will build a neural network to try to predict house prices for other properties, just given the number of bedrooms and the average price per square meter for that suburb. Here is what the training set looks like:

Average price
per square meter
Number of
bedrooms
Sold
price
2,1043399,900
1,6003329,900
2,4003369,000
1,4162232,000
3,0004539,900
1,9854299,900
1,5343314,900
1,4273198,999
1,3803212,000
1,4943242,500

We only have ten observations. That might seem crazy because you may have heard that you need massive data sets (e.g., thousands or hundreds of thousands) to train a neural network. That's true for most situations, but we don't need a lot of data to demonstrate backpropagation. We will use the following neural network structure: 1 hidden layer with 5 neurons, and one neuron in the output layer (because there is only one thing we want to predict: house price). The structure of our network might look something like this:

Where biases are shown in grey and inputs are in blue. The weights for most edges are marked, allowing you to see how we will structure the math. Let's go through all the math before we code this up. First we will state that our original inputs are \(\vec{x^\ast}\) and \(\vec{y^\ast}\), and for good reasons that we will explain later, we will normalise these so that they have zero mean and unit variance. We will refer to the normalised inputs and outputs as \(\vec{x}\) and \(\vec{y}\). We will name neurons by their layer and their order from top to bottom: H1 for the first neuron in the hidden layer (e.g., the one connected to \(w_{17}\)), and O1 for the single output layer. We will also define the input to the activation function \(f\) of a node by \(\alpha\) (e.g., \(\alpha_{H1}\) for the net input to neuron H1) and we will define the output of the activation function of a neuron to be \(\beta\) (e.g., \(\beta_{H1} = f(\alpha_{H1})\)).

The inputs and outputs for each neuron in the hidden layer are: \[ \begin{eqnarray} \alpha_{H1} &=& w_1 + x_{1}w_{6} + x_{2}w_{11} \\ \beta_{H1} &=& f(\alpha_{H1}) \end{eqnarray} \] \[ \begin{eqnarray} \alpha_{H2} &=& w_2 + x_{1}w_{7} + x_{2}w_{12} \\ \beta_{H2} &=& f(\alpha_{H2}) \end{eqnarray} \] \[ \begin{eqnarray} \alpha_{H3} &=& w_3 + x_{1}w_{8} + x_{2}w_{13} \\ \beta_{H3} &=& f(\alpha_{H3}) \end{eqnarray} \] \[ \begin{eqnarray} \alpha_{H4} &=& w_4 + x_{1}w_{9} + x_{2}w_{14} \\ \beta_{H4} &=& f(\alpha_{H4}) \end{eqnarray} \] \[ \begin{eqnarray} \alpha_{H5} &=& w_5 + x_{1}w_{10} + x_{2}w_{15} \\ \beta_{H5} &=& f(\alpha_{H5}) \end{eqnarray} \]

And for the output layer: \[ \begin{eqnarray} \alpha_{O1} &=& w_{16} + \beta_{H1}w_{17} + \beta_{H2}w_{18} + \beta_{H3}w_{19} + \beta_{H4}w_{20} + \beta_{H5}w_{21}\\ \beta_{O1} &=& f(\alpha_{O1}) \end{eqnarray} \] where \(\beta_{O1}\) is also known as \(\vec{y}_\textrm{pred}\).

We will initialise the weights randomly. In reality we often do initialise the weights of a neural network randomly, but we don't do so uniformly — we normally want to draw from some distribution and clip values within ranges. You will learn more about that this week. For now let us assume that we have initialised the weights in The Right Way™. First let's code up our neural network. Later on you will see how to do this in PyTorch (by Facebook™) and TensorFlow (by Google™), but to get us started we will code everything in numpy.

import numpy as np # Load in the data. X = np.array([ [2104, 3], [1600, 3], [2400, 3], [1416, 2], [3000, 4], [1985, 4], [1534, 3], [1427, 3], [1380, 3], [1494, 3], ]) Y = np.array([ 399900, 329900, 369000, 232000, 539900, 299900, 314900, 198999, 212000, 242500, ]).reshape((-1, 1)) # For good reasons that we will explain later, we need to "normalise" # the data so that it has zero mean and unit variance. # (Just like what we did when we covered dimensionality reduction!) X_mean, X_std = (np.mean(X, axis=0), np.std(X, axis=0)) x = (X - X_mean) / X_std Y_mean, Y_std = (np.mean(Y), np.std(Y)) y = (Y - Y_mean) / Y_std # Let's define some variables for the number of inputs and outputs. N, D_in = x.shape N, D_out = y.shape H = 5 # The number of neurons in the hidden layer. # Let's define our activation function. sigmoid = lambda x: 1/(1 + np.exp(-x)) # Let's initialise all the weights randomly. # (This is terrible code. I am only naming variables like this # so you can see exactly how things work. You should use matrix # multiplication!) # Weights for the bias terms in the hidden layer: w1, w2, w3, w4, w5 = np.random.randn(H) # Weights for x1 to all neurons. w6, w7, w8, w9, w10 = np.random.randn(H) # Weights for x2 to all neurons. w11, w12, w13, w14, w15 = np.random.randn(H) # Weights for hidden layer outputs to output neuron. w16, w17, w18, w19, w20, w21 = np.random.randn(H + 1) # Ok, let's code up our neural network! # Hidden layer. hidden_layer_inputs = np.hstack([ x, np.ones((N, 1)) ]) hidden_layer_weights = np.array([ [ w1, w2, w3, w4, w5], [ w6, w7, w8, w9, w10], [w11, w12, w13, w14, w15] ]) hidden_layer_sums = hidden_layer_inputs @ hidden_layer_weights hidden_layer_outputs = sigmoid(hidden_layer_sums) # Output layer. output_layer_inputs = np.hstack([ hidden_layer_outputs, np.ones((N, 1)) ]) output_layer_weights = np.array([ [w16, w17, w18, w19, w20, w21] ]).T output_layer_sums = output_layer_inputs @ output_layer_weights output_layer_outputs = sigmoid(output_layer_sums) # Calculate our loss function: the total error in our predictions # compared to the target. loss = 0.5 * np.sum((y_pred - y)**2) print(f"Initial loss: {loss:.0e}")

Here we have initialised our weights randomly and completed our first forward pass through the network, where we are using the inputs and the randomised weights to make predictions \(\vec{y}_\textrm{pred}\) for house prices. These initial predictions will not be very good because our weights were randomly chosen. Indeed, the initial loss \(E\) \[ E = \frac{1}{2}\sum_{i}^{N}(y_i - y_{\textrm{pred},i})^2 \] is 6, which is a lot given that we have 10 observations that are all zero mean and unit variance! But this does show that we can feed inputs into a neural network and get it to provide us with an output! Now let's explore how we can use backpropagation to improve the weights, and improve our predictions.

The idea of backprop is to propagate the errors in our predictions back through the network and to iteratively improve the weights. Recall that in Stochastic Gradient Descent we use the derivative of the outputs with respect to the parameters to improve our estimates of those parameters. We are going to do the exact same thing here. First let's consider that we are going to update our estimate of \(w_{16}\). To do so we will need the gradient of the loss with respect to \(w_{16}\) \[ \frac{\partial{}E}{\partial{}w_{16}} \] which we know from the chain rule is \[ \frac{\partial{}E}{\partial{}w_{16}} = \frac{\partial{}E}{\partial{}\beta_{O1}} \frac{\partial{}\beta_{O1}}{\partial{}\alpha_{O1}} \frac{\partial{}\alpha_{O1}}{\partial{}w_{16}} \] where we will have to calculate each term. First let us recall that since \(y_{\textrm{pred},i} \equiv \beta_{O1,i}\), \[ E = \frac{1}{2}\sum_{i=1}^{N}\left(y_i - \beta_{O1,i}\right)^2 \] so \[ \frac{\partial{}E}{\partial{}\beta_{O1}} = -\left(y - \beta_{O1}\right) \quad . \] For the next term recall that \[ \beta_{O1} = f(\alpha_{O1}) \] where \(f\) is a sigmoid \[ \beta_{O1} = f(\alpha_{O1}) = \frac{1}{1 + e^{-\alpha_{O1}}} \] such that \[ \frac{\partial{}\beta_{O1}}{\partial{}\alpha_{O1}} = \beta_{O1}\left(1 - \beta_{O1}\right) \quad . \] And finally, \[ \alpha_{O1} = w_{16} + \beta_{H1}w_{17} + \beta_{H2}w_{18} + \beta_{H3}w_{19} + \beta_{H4}w_{20} + \beta_{H5}w_{21} \] such that \[ \frac{\partial{}\alpha_{O1}}{\partial{}w_{16}} = 1 \quad . \]

Bringing these terms together, \[ \begin{eqnarray} \frac{\partial{}E}{\partial{}w_{16}} &=& \frac{\partial{}E}{\partial{}\beta_{O1}} \frac{\partial{}\beta_{O1}}{\partial{}\alpha_{O1}} \frac{\partial{}\alpha_{O1}}{\partial{}w_{16}} \\ \frac{\partial{}E}{\partial{}w_{16}} &=& \beta_{O1}(\beta_{O1}-y)(1-\beta_{O1}) \end{eqnarray} \] allows us to update our estimate of \(w_{16}\), after we adopt some learning rate \(\eta = 10^{-6}\): \[ w_{16}^{(t+1)} = w_{16}^{(t)} - \eta\frac{\partial{}E}{\partial{}w_{16}} \quad. \] But we don't want to update one weight at a time; we want to calculate the updated weights (during the backward pass) and then update them all at once before making another prediction. Let us consider that we carefully calculated all gradients that we needed for all the weights \(w_{16}\) to \(w_{21}\), and now we are at the stage of calculating the updated weights \(w_{1}\) to \(w_{15}\). For \(w_1\) the procedure is very similar: we flow gradients back through the graph such that \[ \frac{\partial{}E}{\partial{}w_{1}} = \frac{\partial{}E}{\partial{}\beta_{O1}} \frac{\partial{}\beta_{O1}}{\partial{}\alpha_{O1}} \frac{\partial{}\alpha_{O1}}{\partial{}\beta_{H1}} \frac{\partial{}\beta_{H1}}{\partial{}\alpha_{H1}} \frac{\partial{}\alpha_{H1}}{\partial{}w_{1}} \] where we already have \[ \begin{eqnarray} \frac{\partial{}E}{\partial{}\beta_{O1}} &=& -\left(y - \beta_{O1}\right) \\ \frac{\partial{}\beta_{O1}}{\partial{}\alpha_{O1}} &=& \beta_{O1}\left(1 - \beta_{O1}\right) \end{eqnarray} \] and for \(\frac{\partial{}\alpha_{O1}}{\partial{}\beta_{H1}}\), \[ \alpha_{O1} = w_{16} + \beta_{H1}w_{17} + \beta_{H2}w_{18} + \beta_{H3}w_{19} + \beta_{H4}w_{20} + \beta_{H5}w_{21} \] such that \[ \frac{\partial{}\alpha_{O1}}{\partial{}\beta_{H1}} = w_{17} \] and just like \(\frac{\partial{}\beta_{O1}}{\partial{}\alpha_{O1}}\), \[ \frac{\partial{}\beta_{H1}}{\partial{}\alpha_{H1}} = \beta_{H1}\left(1 - \beta_{H1}\right) \] which leads to the final term \(\frac{\partial{}\alpha_{H1}}{\partial{}w_{1}}\) where \[ \alpha_{H1} = w_1 + x_{1}w_{6} + x_{2}w_{11} \] such that \[ \frac{\partial{}\alpha_{H1}}{\partial{}w_{1}} = 1 \quad . \] Bringing it all together yields \[ \begin{eqnarray} \frac{\partial{}E}{\partial{}w_{1}} &=& \frac{\partial{}E}{\partial{}\beta_{O1}} \frac{\partial{}\beta_{O1}}{\partial{}\alpha_{O1}} \frac{\partial{}\alpha_{O1}}{\partial{}\beta_{H1}} \frac{\partial{}\beta_{H1}}{\partial{}\alpha_{H1}} \frac{\partial{}\alpha_{H1}}{\partial{}w_{1}} \\ \frac{\partial{}E}{\partial{}w_{1}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{17}\beta_{H1}\left(1 - \beta_{H1}\right) \end{eqnarray} \] such that the updated estimate for \(w_1^{(t+1)}\) is \[ w_1^{(t+1)} = w_1^{(t)} - \eta\frac{\partial{}E}{\partial{}w_{1}} \quad . \] And you can see that this process can be repeated (and needs to be) for all weight terms. In doing so you should find: \[ \begin{eqnarray} \frac{\partial{}E}{\partial{}w_{1}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{17}\beta_{H1}\left(1 - \beta_{H1}\right) \\ \frac{\partial{}E}{\partial{}w_{2}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{18}\beta_{H2}\left(1 - \beta_{H2}\right) \\ \frac{\partial{}E}{\partial{}w_{3}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{19}\beta_{H3}\left(1 - \beta_{H3}\right) \\ \frac{\partial{}E}{\partial{}w_{4}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{20}\beta_{H4}\left(1 - \beta_{H4}\right) \\ \frac{\partial{}E}{\partial{}w_{5}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{21}\beta_{H5}\left(1 - \beta_{H5}\right) \\ \frac{\partial{}E}{\partial{}w_{6}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{17}\beta_{H1}\left(1 - \beta_{H1}\right)x_1 \\ \frac{\partial{}E}{\partial{}w_{7}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{18}\beta_{H2}\left(1 - \beta_{H2}\right)x_1 \\ \frac{\partial{}E}{\partial{}w_{8}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{19}\beta_{H3}\left(1 - \beta_{H3}\right)x_1 \\ \frac{\partial{}E}{\partial{}w_{9}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{20}\beta_{H4}\left(1 - \beta_{H4}\right)x_1 \\ \frac{\partial{}E}{\partial{}w_{10}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{21}\beta_{H5}\left(1 - \beta_{H5}\right)x_1 \\ \frac{\partial{}E}{\partial{}w_{11}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{17}\beta_{H1}\left(1 - \beta_{H1}\right)x_2 \\ \frac{\partial{}E}{\partial{}w_{12}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{18}\beta_{H2}\left(1 - \beta_{H2}\right)x_2 \\ \frac{\partial{}E}{\partial{}w_{13}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{19}\beta_{H3}\left(1 - \beta_{H3}\right)x_2 \\ \frac{\partial{}E}{\partial{}w_{14}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{20}\beta_{H4}\left(1 - \beta_{H4}\right)x_2 \\ \frac{\partial{}E}{\partial{}w_{15}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)w_{21}\beta_{H5}\left(1 - \beta_{H5}\right)x_2 \\ \frac{\partial{}E}{\partial{}w_{16}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right) \\ \frac{\partial{}E}{\partial{}w_{17}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)\beta_{H1} \\ \frac{\partial{}E}{\partial{}w_{18}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)\beta_{H2} \\ \frac{\partial{}E}{\partial{}w_{19}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)\beta_{H3} \\ \frac{\partial{}E}{\partial{}w_{20}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)\beta_{H4} \\ \frac{\partial{}E}{\partial{}w_{21}} &=& \left(\beta_{O1} - y\right)\beta_{O1}\left(1 - \beta_{O1}\right)\beta_{H5} \end{eqnarray} \]

Let's code this up. num_epochs = 100000 losses = np.empty(num_epochs) eta = 1e-3 for epoch in range(num_epochs): # Hidden layer. hidden_layer_inputs = np.hstack([ np.ones((N, 1)), x ]) hidden_layer_weights = np.array([ [ w1, w2, w3, w4, w5], [ w6, w7, w8, w9, w10], [w11, w12, w13, w14, w15] ]) alpha_h = hidden_layer_inputs @ hidden_layer_weights beta_h = sigmoid(alpha_h) # Output layer. output_layer_inputs = np.hstack([ np.ones((N, 1)), beta_h ]) output_layer_weights = np.array([ [w16, w17, w18, w19, w20, w21] ]).T alpha_o = output_layer_inputs @ output_layer_weights beta_o = sigmoid(alpha_o) y_pred = beta_o # Calculate our loss function: the average error in our predictions compared to the target. # (This is also known as the mean squared error). loss = 0.5 * np.sum((y_pred - y)**2) if not epoch % 100: print(epoch, loss) losses[epoch] = loss # Calculate gradients. s = (beta_o - y) * beta_o * (1 - beta_o) dE_dw21 = s * beta_h[:, [4]] dE_dw20 = s * beta_h[:, [3]] dE_dw19 = s * beta_h[:, [2]] dE_dw18 = s * beta_h[:, [1]] dE_dw17 = s * beta_h[:, [0]] dE_dw16 = s dE_dw15 = s * w21 * beta_h[:, [4]] * (1 - beta_h[:, [4]]) * x[:, [1]] dE_dw14 = s * w20 * beta_h[:, [3]] * (1 - beta_h[:, [3]]) * x[:, [1]] dE_dw13 = s * w19 * beta_h[:, [2]] * (1 - beta_h[:, [2]]) * x[:, [1]] dE_dw12 = s * w18 * beta_h[:, [1]] * (1 - beta_h[:, [1]]) * x[:, [1]] dE_dw11 = s * w17 * beta_h[:, [0]] * (1 - beta_h[:, [0]]) * x[:, [1]] dE_dw10 = s * w21 * beta_h[:, [4]] * (1 - beta_h[:, [4]]) * x[:, [0]] dE_dw9 = s * w20 * beta_h[:, [3]] * (1 - beta_h[:, [3]]) * x[:, [0]] dE_dw8 = s * w19 * beta_h[:, [2]] * (1 - beta_h[:, [2]]) * x[:, [0]] dE_dw7 = s * w18 * beta_h[:, [1]] * (1 - beta_h[:, [1]]) * x[:, [0]] dE_dw6 = s * w17 * beta_h[:, [0]] * (1 - beta_h[:, [0]]) * x[:, [0]] dE_dw5 = s * w21 * beta_h[:, [4]] * (1 - beta_h[:, [4]]) dE_dw4 = s * w20 * beta_h[:, [3]] * (1 - beta_h[:, [3]]) dE_dw3 = s * w19 * beta_h[:, [2]] * (1 - beta_h[:, [2]]) dE_dw2 = s * w18 * beta_h[:, [1]] * (1 - beta_h[:, [1]]) dE_dw1 = s * w17 * beta_h[:, [0]] * (1 - beta_h[:, [0]]) # Now update the weights using stochastic gradient descent. w1 = w1 - eta * np.sum(dE_dw1) w2 = w2 - eta * np.sum(dE_dw2) w3 = w3 - eta * np.sum(dE_dw3) w4 = w4 - eta * np.sum(dE_dw4) w5 = w5 - eta * np.sum(dE_dw5) w6 = w6 - eta * np.sum(dE_dw6) w7 = w7 - eta * np.sum(dE_dw7) w8 = w8 - eta * np.sum(dE_dw8) w9 = w9 - eta * np.sum(dE_dw9) w10 = w10 - eta * np.sum(dE_dw10) w11 = w11 - eta * np.sum(dE_dw11) w12 = w12 - eta * np.sum(dE_dw12) w13 = w13 - eta * np.sum(dE_dw13) w14 = w14 - eta * np.sum(dE_dw14) w15 = w15 - eta * np.sum(dE_dw15) w16 = w16 - eta * np.sum(dE_dw16) w17 = w17 - eta * np.sum(dE_dw17) w18 = w18 - eta * np.sum(dE_dw18) w19 = w19 - eta * np.sum(dE_dw19) w20 = w20 - eta * np.sum(dE_dw20) w21 = w21 - eta * np.sum(dE_dw21)

Before we run this, let's have a brief aside on why (and how!) you should check your derivatives when coding something like this up.

Check your derivatives!

Whenever you are coding up something that involves derivatives you should follow these rules:

  1. Don't calculate them yourself unless you have to. (Let a computer do it for you!)
  2. If you have to calculate them yourself, check it numerically!

Let's check our derivatives numerically: import numpy as np from scipy.optimize import check_grad def f(p): w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, \ w14, w15, w16, w17, w18, w19, w20, w21 = p # Hidden layer. hidden_layer_inputs = np.hstack([ np.ones((N, 1)), x ]) hidden_layer_weights = np.array([ [ w1, w2, w3, w4, w5], [ w6, w7, w8, w9, w10], [w11, w12, w13, w14, w15] ]) alpha_h = hidden_layer_inputs @ hidden_layer_weights beta_h = sigmoid(alpha_h) # Output layer. output_layer_inputs = np.hstack([ np.ones((N, 1)), beta_h ]) output_layer_weights = np.array([ [w16, w17, w18, w19, w20, w21] ]).T alpha_o = output_layer_inputs @ output_layer_weights beta_o = sigmoid(alpha_o) y_pred = beta_o return 0.5 * np.sum((y_pred - y)**2) def g(p): w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, \ w14, w15, w16, w17, w18, w19, w20, w21 = p # Hidden layer. hidden_layer_inputs = np.hstack([ np.ones((N, 1)), x ]) hidden_layer_weights = np.array([ [ w1, w2, w3, w4, w5], [ w6, w7, w8, w9, w10], [w11, w12, w13, w14, w15] ]) alpha_h = hidden_layer_inputs @ hidden_layer_weights beta_h = sigmoid(alpha_h) # Output layer. output_layer_inputs = np.hstack([ np.ones((N, 1)), beta_h ]) output_layer_weights = np.array([ [w16, w17, w18, w19, w20, w21] ]).T alpha_o = output_layer_inputs @ output_layer_weights beta_o = sigmoid(alpha_o) # Calculate gradients. s = (beta_o - y) * beta_o * (1 - beta_o) dE_dw21 = s * beta_h[:, [4]] dE_dw20 = s * beta_h[:, [3]] dE_dw19 = s * beta_h[:, [2]] dE_dw18 = s * beta_h[:, [1]] dE_dw17 = s * beta_h[:, [0]] dE_dw16 = s dE_dw15 = s * w21 * beta_h[:, [4]] * (1 - beta_h[:, [4]]) * x[:, [1]] dE_dw14 = s * w20 * beta_h[:, [3]] * (1 - beta_h[:, [3]]) * x[:, [1]] dE_dw13 = s * w19 * beta_h[:, [2]] * (1 - beta_h[:, [2]]) * x[:, [1]] dE_dw12 = s * w18 * beta_h[:, [1]] * (1 - beta_h[:, [1]]) * x[:, [1]] dE_dw11 = s * w17 * beta_h[:, [0]] * (1 - beta_h[:, [0]]) * x[:, [1]] dE_dw10 = s * w21 * beta_h[:, [4]] * (1 - beta_h[:, [4]]) * x[:, [0]] dE_dw9 = s * w20 * beta_h[:, [3]] * (1 - beta_h[:, [3]]) * x[:, [0]] dE_dw8 = s * w19 * beta_h[:, [2]] * (1 - beta_h[:, [2]]) * x[:, [0]] dE_dw7 = s * w18 * beta_h[:, [1]] * (1 - beta_h[:, [1]]) * x[:, [0]] dE_dw6 = s * w17 * beta_h[:, [0]] * (1 - beta_h[:, [0]]) * x[:, [0]] dE_dw5 = s * w21 * beta_h[:, [4]] * (1 - beta_h[:, [4]]) dE_dw4 = s * w20 * beta_h[:, [3]] * (1 - beta_h[:, [3]]) dE_dw3 = s * w19 * beta_h[:, [2]] * (1 - beta_h[:, [2]]) dE_dw2 = s * w18 * beta_h[:, [1]] * (1 - beta_h[:, [1]]) dE_dw1 = s * w17 * beta_h[:, [0]] * (1 - beta_h[:, [0]]) return np.array([ np.sum(dE_dw1), np.sum(dE_dw2), np.sum(dE_dw3), np.sum(dE_dw4), np.sum(dE_dw5), np.sum(dE_dw6), np.sum(dE_dw7), np.sum(dE_dw8), np.sum(dE_dw9), np.sum(dE_dw10), np.sum(dE_dw11), np.sum(dE_dw12), np.sum(dE_dw13), np.sum(dE_dw14), np.sum(dE_dw15), np.sum(dE_dw16), np.sum(dE_dw17), np.sum(dE_dw18), np.sum(dE_dw19), np.sum(dE_dw20), np.sum(dE_dw21) ]) assert 1e-6 > check_grad(f, g, np.random.normal(size=21))

If all goes well,..

Then after we run our neural network then we should see the loss reducing with time. Let's see: import matplotlib.pyplot as plt from matplotlib.ticker import MaxNLocator fig, ax = plt.subplots(figsize=(4, 4)) ax.plot(losses) ax.set_xlabel(r"$\textrm{Epoch}$") ax.set_ylabel(r"$\textrm{Loss}$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(6)) fig.tight_layout()

Nice! But is that it? Do we really have to code every neural network ourselves and calculate all those derivatives? Of course not. Once you understand something well enough to code it up yourself, you should use a package that is well-engineered for performance. Here is the equivalent code in PyTorch: import torch import numpy as np X = np.array([ [2104, 3], [1600, 3], [2400, 3], [1416, 2], [3000, 4], [1985, 4], [1534, 3], [1427, 3], [1380, 3], [1494, 3], ]) Y = np.atleast_2d([ 399900, 329900, 369000, 232000, 539900, 299900, 314900, 198999, 212000, 242500, ]).T Y_mean, Y_std = (np.mean(Y), np.std(Y)) y = (Y - Y_mean) / Y_std # Normalise X -- we will explain why we do this in later classes. X_mean, X_std = (np.mean(X, axis=0), np.std(X, axis=0)) x = (X - X_mean) / X_std N, D_in = x.shape N, D_out = y.shape H = 5 x = torch.tensor(x, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32) # Construct the model. model = torch.nn.Sequential( torch.nn.Linear(D_in, H), torch.nn.Sigmoid(), torch.nn.Linear(H, D_out), ) loss_fn = torch.nn.MSELoss(reduction='sum') epochs = 100000 learning_rate = 1e-4 losses = np.empty(epochs) for t in range(epochs): # Forward pass. y_pred = model(x) # Compute loss. loss = loss_fn(y_pred, y) if t % 100 == 99: print(t, loss.item()) losses[t] = loss.item() # Zero the gradients before running the backward pass. model.zero_grad() # Backward pass. loss.backward() # Update the weights using gradient descent. with torch.no_grad(): for param in model.parameters(): param -= learning_rate * param.grad

Let's plot the loss as a function of epoch for this model: import matplotlib.pyplot as plt from matplotlib.ticker import MaxNLocator # Plot the losses with time. fig, ax = plt.subplots(figsize=(4, 4)) ax.plot(losses) ax.set_xlabel(r"$\textrm{Epoch}$") ax.set_ylabel(r"$\textrm{Loss}$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(6)) fig.tight_layout()

Much nicer! Now let's look at the actual predictions from the model, compared to the actual prices. fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter( Y / 1000, (y_pred.detach().flatten() * Y_std + Y_mean) / 1000 ) lims = np.array([ ax.get_xlim(), ax.get_ylim() ]) lims = (np.min(lims), np.max(lims)) ax.plot( lims, lims, lw=1, ls=":", c="#666666", zorder=-1 ) ax.set_xlim(lims) ax.set_ylim(lims) ax.set_xlabel(r"$\textrm{Actual price}$ $(\textrm{k})$") ax.set_ylabel(r"$\textrm{Predicted price}$ $(\textrm{k})$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(6)) fig.tight_layout()

Looks good! But why was the loss curve so different for our numpy implementation than that of the PyTorch model? There are many reasons for this, which — now that you know how backpropagation works — we will go through.

Backpropagation problems

Vanishing gradients

In our network we are using sigmoids as the activation function. That's fine, but let's consider what happens if we were to initialise the weights incorrectly, or if we didn't scale the initial data set properly. If our weights are too large then the net input \(\alpha\) into the sigmoid function \(f\) is going to be larege, which means the output \(\beta\) is going to be near 1. In theory that is OK: the neuron has been activated. But consider the backpropagation step! The derivative of a sigmoid \[ \beta = f(\alpha) = \frac{1}{1 + e^{-\alpha}} \] is \[ \frac{\partial\beta}{\partial\alpha} = \beta(1-\beta) \quad . \] If the net input \(\alpha\) is too high then the output \(\beta\) will be 1, which means that the derivative over that neuron will be zero! And depending on where that neuron is in the network, that could mean that many of our gradients end up vanishing to zero because those gradients get multiplied by zero. So we could lose many gradients simply because one neuron had a set of weights being too high! And because the gradient is zero, there will be no updates to the relevant weights. That means vanishing gradients like this — caused by either bad initialisation, noisy optimisation, or incorrectly scaled data set — can totally kill a large part of your network! So if you're using sigmoids (or tanh functions) then you should worry about saturating them.

Dead neurons

This is a different, but related problem. Let's introduce you to another activation function: the Rectified Linear Unit (ReLU) function. The function is defined by \[ f(x) = \textrm{max}(0, x) \] and has an activation profile as:

That means if the input value is less than zero then the neuron won't fire. And if the input value gets very low, the gradient vanishes away to zero. When that happens you effectively have the same problem as a saturated sigmoid function: one of your neurons is permanently dead and it will cause all other neurons in the same multiplicative chain to be zero as well. This can happen through bad initialisation, or through an update step during training if the learning rate was too high.

Explosive gradients

Hopefully you're starting to get the picture that gradients across one neuron can have a flow-on effect in the rest of the network. But zero-ing out neurons is not the only problem. In Recurrent Neural Networks (RNNs; which we will cover next week) you can have neurons that are fed information between states for many time steps. When that happens it is the same as if you have some small error that builds up over time. It's the same as taking the multiplication between two numbers \(a\) and \(b\) like \(abbbbbbbb\)... If \(|b| > 1\) then the network will eventually go to infinity, and if \(|b| < 1\) then the network will eventually go to zero. In either case, you are going to have a bad time!

Noisy updates and aggressive learning rates

Often when you train a neural network you won't want to use all of the data to update the weights. Instead you might decide to use a single observation, or a small batch of data, to update the weights. When that happens you are doing fewer calculations and approximating the gradient of the entire data set. If you are using a learning rate that is too high, then the error introduced by this approximate gradient can cause the weights to be updated too much (or too little), and send your neural network into a spiraling mess. If this happens you will see your loss function accelerating up with increasing epochs.And this can happen even if you are using the entire data set! Your learning rate needs to be set appropriately otherwise backpropagation will fail you.

Summary

Neural networks are a simplified mathematical description of how we think humans think. Although they are a significant simplification of how a neuron actually works, they are sufficiently flexible that they are often referred to as universal interpolators because they can learn any non-linear relationship from data. There are many choices you have to face when designing a neural network, including the structure of the network, initialisation, optimisation, activation functions, et cetera. Even if you get all those choices correct, it can still be difficult to train a neural network! Backpropagation was the key step that allowed us to train complex neural networks, but backpropagation will not solve all your problems! You must think carefully about how backpropagation will be executed on your network in order to understand how it will fail.

In the next class we are going to cover how to train neural networks in practice. This will include how to chose between activation functions, how to initialise your weights correctly, how to prepare your data correctly for neural networks, and things like dropout. After that we will go through neural network architecture design.

 

← Previous class
Clustering
Next class →
Neural networks II