Optimisation

Data analysis and machine learning

Optimisation refers to the process of finding the best set of parameters (given some objective function) from all possible parameter space.

Optimisation is hard

Optimisation is hardAnyone who says otherwise isn't working on anything particularly difficult., but it is still easier than other problems in data analysis. For example, optimisation is easier than most kinds of inference problems (e.g., Markov-Chain Monte Carlo sampling), and so wherever possible we would try to reduce a problem to an optimisation problem one to make things a little bit easier.

Recall the objective function that we introduced in lesson one. This is a scalar-justified quantitative function, one that if we maximize (or minimize, depending on context) then our model will better describe the data. In the first class we showed how best-fitting model parameters can be found by linear algebra for the simplest case of a straight line model. In later classes we largely omitted discussion about how to estimate the model parameters for more complex models.

Imagine you are tasked with finding the best-fitting model parameters for a complex, non-linear model. Without any knowledge about the model or the dependence between parameters, where would you start? It may surprise you to know that practitioners will frequently construct a grid in parameter space to estimate best-fitting parameters. Consider you had a vector of unknown parameters \(\mathbf{\theta}\) and you roughly knew some physically plausible bounds of each parameter. If you only have two dimensions (\(D = 2\)) then you could construct an \(N \times N\) grid of \((\theta_1,\theta_2)\) points and evaluate the log likelihood \(\log\mathcal{L}\) at each point. In this ficticious example we have to calculate the log likelihood for \(N = 50 \) or a total of 2,500 points. If it's quick to evaluate \(\log\mathcal{L}\) then perhaps that's no problem. We can easily just lookup the parameters of the corresponding function evaluation that had the highest \(\log\mathcal{L}\) value, and we may be close enough to correct.

What if you were given a problem with \(D = 50\) dimensions? Using the same grid spacing per dimension leads to \(N^D\) evaluations of \(\log\mathcal{L}\), which is about \(10^{85}\), or more than the number of protons in the universe. Even if you just took \(N = 3\) three points along each dimension of your grid, you still end up with about one mole of \(\log\mathcal{L}\) evaluations to make (\(\sim 10^{23}\)). That's clearly intractable. Does that mean that optimisation with \(D \approx 50\) parameters is intractable, too?

Optimisation problems with dimensionality \(D = 50\) (and worse) are extremely common! Deep neural networks will often have billions of parameters. With thoughtful consideration, highly multi-variate optimisation problems can be efficiently solved. So how do we approach these?

Common optimisation methods

Optimisation is simply a process of intelligently moving through parameter space to minimiseOr maximise, but for the purposes of good writing, I will only talk about the minimisation case. some objective function. We intelligently move through the parameter space because we cannot afford to make \(\sim 10^{85}\) evaluations of the objective function. The common attributes of an optimisation algorithm are:

Evaluate the objective function at some initialisation point.
Look around in parameter space to estimate the local curvature of the landscape, or use derivatives to estimate the local curvature.
Propose a new step in parameter space.
Evaluate whether that step was a good one or not. If it was, move to that step.
Repeat from Step #2Although some optimisation algorithms will not re-estimate the curvature with each function evaluation..

Let's introduce you to some common optimisation algorithms.

Nelder-Mead

The Nelder-Mead method is a simple and slow optimisation method, that is otherwise known as the downhill simplex algorithm. It can be useful if you don't know anything about gradients and you want a direct search that just goes on and on. While Nelder-Mead can be good for some multi-variate problems where no gradient information is known, there are many problems where it fails and alternative methods do not (e.g., non-stationary points).

Nelder-Mead uses a simplex (which you can think of as a geometric blob) to move through the parameter space. It extrapolates the objective function estimated from each point in the blob in order to find a new test point. If the new point is better than the existing point then it replaces the old point with the new one. To illustrate this I have included Ben Frederickson's excellent visualisation on Nelder-Mead optimisation below.

The contraction and expansion rate are hyperparameters that effectively vary the next step size. Usually these hyperparameters are not available for you to varyUnless you implement it yourself.: they are shown here to help illustrate how the algorithm works. The usual settings are to half the step size when contracting and double the step size when expanding.

You can see more of the 'geometric blob' behaviour when considering a multivariate problem, like the ones below:

Gradient descent, and it's conjugate

Gradient descent does exactly what it says on the tin. It estimates the gradient \(\nabla{}F(X_i)\) of the objective function \(F(x)\) at the current point, and steps towards the gradient. You don't have to write down the derivative yourself: the gradient can be calculated by symbolic diffferentiation, or using automatic differentiation. With a fixed step size \(\alpha\) the current point \(X_i\) is updated by: \[ X_{i + 1} = X_{i} - \alpha\nabla{}F\left(X_i\right) \]

The step size \(\alpha\) (or more commonly known as the learning rate defines how gradient descent will reach the local minima. If the rate is too low then the algorithm will take too long to find the solution, taking tiny steps towards as it moves through parameter space. If the learning rate is too high it can jump around the global minima and never converge. In both cases you end up needing other optimisation hyperparameters (e.g., maximum number of evaluations) to make the algorithm stop.

One modification that helps with this problem is to include a line search at every evaluation. This will do a mini-search at every evaluation and adjust the step size \(\alpha\) such that the objective function is always decreasing. That means it will take larger steps through relatively flat parts of parameter space, and smaller steps when the gradient is changing rapidly. In practice that means fewer iterations (steps), but overall could mean more evaluations of the objective function in order to iteratively set the step size.

Knowing the gradient of the objective function is clearly useful for proposing the next step. But there are many functions where the gradient is insufficient for efficiently moving through the parameter space. Rosenbrocks function, \[ f(x, y) = (1 - x)^2 + 100(y - x^2)^2 \] is one such example, where gradient descent performs poorly. For this (and many other) kinds of problems, what we actually want is the curvature of the objective function.

There are many ways to approximate this. One popular method, dubbed conjugate gradient descent, is to use the gradient evaluations from successive iterations to improve the proposal step. That way you're not just using the instantaneous gradient at the point being evaluated: you're also using the instantaneous gradients from the nearby, previous evaluations, which is giving you some estimate of the local curvature. See the example below where conjugate gradient descent is used with Rosenbrocks function, compared to the performance above using only gradient descent.

BFGS

Gradient descent, and conjugate gradient descent, can both be described as first order methods because they make use of gradients (first derivatives) to minimise the objective function. A Rosenbrocks function shows that even if you have the gradient information, it is not always enough to estimate the curvature of parameter space. Even if you are doing conjugate gradient descent, where you have estimates of the gradient from successive iterations, you are still only using the first order information at each step.

A second order method is one that makes use of the second derivatives (and first derivatives, and the objective function) to optimise an objective function. Formally, second order methods use the Hessian,Well, the inverse Hessian, but now we're getting into semantics. \[ \mathbf H = \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \] a square matrix of second-order partial derivatives of the objective function, to calculate the position of the next step.

BFGS is actually a quasi-Newton method, which means you don't have to supply a function to compute the Hessian (or it's transpose, the Jacobian). The Hessian can instead be approximated from the gradient evaluations (or approximate gradient evaluations). It can be expensive to store these approximations, so there is a commonly-used limited memory variation of this method (L-BFGS-B) that only stores the last \(m\) estimates.

Note that even with the Hessian you still have to take an appropriate step size, which means you need to decide on how far you are willing to step in parameter space.And if that step size is different per dimension! See the curse of scaling below. Like the gradient descent example, to solve this problem we perform a one-dimensional optimisation (a line search) to find the acceptable step size in the direction of the inverse Hessian.

Personal opinion: BFGS is the optimisation algorithm I use most commonly in my research, unless I am training a neural network.

Adaptive Moment Estimation

The adaptive moment estimation (Adam) method is commonly used for optimising parameters of deep neural networks. It has an adaptive learning rate (like what we have seen above) and is specifically designed for training deep neural networks. It is also a relatively new optimisation algorithm, which is remarkable to think about: it is not often that a new optimisation algorithm is developed that has immediate, widespread use!

Adam uses estimates of the first and second moments of the gradients to compute individual learning rates for every parameter in a neural network. By using moving averages that are calculated for each batch of data, Adam has a very simple way to adaptively update the weights of a neural network. Consider the following pseudo-code: # Set some optimization hyper-parameters. beta = (0.9, 0.999) step_size = 0.01 epsilon = 1e-5 for iteration in range(num_iterations): grad = compute_gradient(x, y) # Calculate the moving average of the gradient m = beta[0] * m + (1 - beta[0]) * grad # And the moving average of the squared gradient v = beta[1] * v + (1 - beta[1]) * grad**2 # Our estimates of m and v are biased, # (particularly at the start of training), # so we correct for this. m_hat = m / (1 - np.power(beta[0], iteration)) v_hat = v / (1 - np.power(beta[1], iteration)) # Update weights. w = w - step_size * m_hat / (np.sqrt(v_hat) + epsilon)

That seems too simple to be true! Yet it is. The base version of Adam showed great potential to decrease training costs for deep neural networks. And it still does a reasonably good job for most neural networks. However, there are many slight variations that exist. If you experience unusual results with Adam (e.g., not converging to an optimal solution), you may want to consider trying alternative optimisation routines.This is a good thing to do regardless, because there are many problems with optimisation.

Common problems with optimisation

Now that you have an idea for how some of these commonly used optimisation algorithms work, let's consider some common problems with optimisation. Because there are many!

We will start off with an aside by saying that the problems below specifically refer to optimisation of continuous variables. Optimisation of other kinds of variables (e.g., integer optimisation) have their own set of curses, including the ones listed below. For the purposes of this course we will focus on optimisation of continuous parameters, rather than discrete ones.

The curses of convergence and silent failures

This is probably the curse that people overlook the most, to their peril! Let's assume that you only have five unknown parameters that you need to optimise. You're not yet plagued by the curse of dimensionality. When do you tell your optimisation procedure to stop optimising?

In practice optimisation will continue until some threshold of convergence is met, or the optimisation fails. Optimisation can fail, and it often does so silently. That means anytime you optimise something you need to check the outputs of that optimisation procedure to establish whether it met a convergence threshold, or if the process failed.

It's also not enough to simply say that "the optimisation converged" because you have to specify the threshold for what qualifies as convergence. Even if you don't explicitly set it, there will be some default value that is already set for you. The exact definition of convergence varies depending on the algorithm being used, but normally you could think of it as something like \[ |\Delta{}f(x)| < \delta \] where \(\delta\) is some small value that we have specified, and \(\Delta{}f(x)\) is the change in the objective function between two successive iterations. That makes sense. But what if we were to just start with infintesimally small steps in \(x\)? The objective function \(f(x)\) will not change much at all between these two steps, which means that our optimisation procedure could have formally met convergence criteria but not actually have moved anywhere! For this reason it is a good idea to check that the optimised parameters do not exactly match your input parameters! Otherwise your optimisation algorithm may not have actually done anything! For these reasons it is important to be aware of the step size between iterations, and the relative scaling between different parameters.

The curse of dimensionality

See the example we started off with in this class. Even if you can intelligently move through parameter space, you will find that in higher dimensions volume can play funny tricks on you. Even as you intelligently move through the billions of parameters you have, you will still be plauged by the curse of dimensionality. As you go to higher dimensions you will start to feel the pain of optimisation: there are simply too many parameters to even try to map out where to step next, or understand which parameters are more important.

Gridding is always the dumbest thing one can do (if you can afford it)Always start with the dumbest things you can do., and it is worth doing! If uniform gridding is too expensive for all parameters then you may want to have a higher density of grid points for parameters you are less certain about. And if you don't know which parameters are more important than others, you may want to grid uniformly in log space.

If you don't have an efficient optimisation algorithm then you may also want to fix certain parameters (to values that you think are reasonable, or good guesses) and then try to optimise the other parameters. You may even need to optimise the other dimensions one at a time, leaving all other dimensions fixed to the best-found value so far. This is not a fool-proof method, however, as this could still cause you lots of problems!

The curse of scaling

The relative scaling between parameters can be very important. Imagine that you had two parameters that you were going to optimise. One varied between \(-1\) and \(+1\), and the other could vary from \(1\) to \(10^8\). That's seven orders of magnitude difference in the peak-to-peak ranges of those two parameters. Why is this important?

Imagine if the stopping criteria for your optimisation algorithm didn't just depend on the change in the objective function \(\delta\), but also depended on how much the parameters changed between two successive iterations. In that case it is going to really matter how much parameter 1 changes compared to parameter 2: if parameter 1 were to change by 0.001, the same relative scaling in parameter 2 is 100,000! Would you want your optimisation algorithm to stop when the difference in successive steps is 0.001 in parameter 1 and parameter 2? Assuming the same scaling, then by the time the step size in parameter 2 reaches 0.001, the step size in parameter 1 would be \(10^{-11}\)!

For these reasons, you should carefully consider the relative scalings between different parameters. You want the relative scalings between parameters to be about unity.

The curse of non-convexity

If your objective function is convex, it's a good thing. Convex optimisation is a special sub-set of optimisation, and in short you could say that if your problem is convex then the parameters you find through optimisation are mathematically guaranteed to be the global minima. Why is that special? Isn't optimisation always going to bring me to the global minima?

The answer is obviously no. You may have an objective function that looks like a hilly landscape, with peaks and troughs spanning throughout a high dimensional parameter space. For these purposes we will assume you are trying to maximise the function. If you try to optimise an objective function like that then most likely you will end up at a local maxima: a local hill in parameter space. And then your optimisation procedure will stop. But there may be nearby hills that are far taller than the one the optimisation has stopped at! Basin-hopping or simulated annealing can be good for these kinds of problems.

But if your problem is convex, you never have to worry about these kinds of problems: any local minima is guaranteed to be the global minima.

The curse of initialisation

If your problem is convex then initialisation doesn't matter. If your problem is non-convex then initialisation really matters. Optimisation isn't guaranteed to give you the correct global minima, even if your relative parameter scaling is good and even if you have a sensible convergence threshold. You can still find that it will optimise to some local solution, whereas you know that there is a better solution elsewhere in parameter space.

Consider the function \[ f(x, y) = x^2 + y^2 - a\exp{\left(-\frac{(x - 1)^2 + y^2}{c}\right)} - b\exp{\left(-\frac{(x + 1)^2 + y^2}{d}\right)} \] where you can see there are two minima: at \((1, 0)\) and \((-1, 0)\). Four common optimisation algorithms that are used in deep learning are indicated: stochastic gradient descent (SGD), momentum, RMSProp, and Adaptive Moment Estimation (Adam).

What decides whether the optimisation algorithm will find a local minima (there can be many local minima) or a global one? The initialisation, and the algorithm itself. You should always try to initialise at a sensible value. If you don't know what that value is, you may want to initialise at many different locations and run multiple optimisation procedures, and only save the best one. This is a very common thing to do in practice.

In summary

There are many options available when you need to optimise some objective function. The choice of algorithm, its hyper-parameters that define the step size, initialisation value, and convergence thresholds, are all options that demand your decision. You should do your best to chose the most appropriate tool for the job, and you should always be skeptical that optimisation has quietly failed!

In the end, optimisation only provides you with a point estimate of the minimum of the given objective function. But in some scenarios a point estimate is not enough! We sometimes want to draw samples from a distribution and use those samples to quote percentiles (e.g., the 5th, 50th, and 95th percentile). In the next class we will explore how to draw samples from a distribution using Markov-Chain Monte Carlo sampling.

← Previous class
Fitting a model to data III

Next class →
Markov-Chain Monte Carlo sampling

Contributions

The optimisation visualisations for Nelder Mead, Gradient Descent, Conjugate Gradient Descent, and Adam were developed by Ben Frederickson. See his excellent fmin javascript library.

The visualization of different optimization algorithms used in deep learning was created by Emilien Dupont (block aaf429).