Data analysis and machine learning
Optimisation refers to the process of finding the best set of parameters (given some objective function) from all possible parameter space.
Optimisation is hard
Recall the objective function that we introduced in lesson one. This is a scalar-justified quantitative function, one that if we maximize (or minimize, depending on context) then our model will better describe the data. In the first class we showed how best-fitting model parameters can be found by linear algebra for the simplest case of a straight line model. In later classes we largely omitted discussion about how to estimate the model parameters for more complex models.
Imagine you are tasked with finding the best-fitting model parameters for a complex, non-linear model. Without any knowledge about the model or the dependence between parameters, where would you start? It may surprise you to know that practitioners will frequently construct a grid in parameter space to estimate best-fitting parameters. Consider you had a vector of unknown parameters \(\mathbf{\theta}\) and you roughly knew some physically plausible bounds of each parameter. If you only have two dimensions (\(D = 2\)) then you could construct an \(N \times N\) grid of \((\theta_1,\theta_2)\) points and evaluate the log likelihood \(\log\mathcal{L}\) at each point. In this ficticious example we have to calculate the log likelihood for \(N = 50 \) or a total of 2,500 points. If it's quick to evaluate \(\log\mathcal{L}\) then perhaps that's no problem. We can easily just lookup the parameters of the corresponding function evaluation that had the highest \(\log\mathcal{L}\) value, and we may be close enough to correct.
What if you were given a problem with \(D = 50\) dimensions? Using the same grid spacing per dimension leads to \(N^D\) evaluations of \(\log\mathcal{L}\), which is about \(10^{85}\), or more than the number of protons in the universe. Even if you just took \(N = 3\) three points along each dimension of your grid, you still end up with about one mole of \(\log\mathcal{L}\) evaluations to make (\(\sim 10^{23}\)). That's clearly intractable. Does that mean that optimisation with \(D \approx 50\) parameters is intractable, too?
Optimisation problems with dimensionality \(D = 50\) (and worse) are extremely common! Deep neural networks will often have billions of parameters. With thoughtful consideration, highly multi-variate optimisation problems can be efficiently solved. So how do we approach these?
Optimisation is simply a process of intelligently moving through parameter space to minimise
Let's introduce you to some common optimisation algorithms.
The Nelder-Mead method is a simple and slow optimisation method, that is otherwise known as the downhill simplex algorithm. It can be useful if you don't know anything about gradients and you want a direct search that just goes on and on. While Nelder-Mead can be good for some multi-variate problems where no gradient information is known, there are many problems where it fails and alternative methods do not (e.g., non-stationary points).
Nelder-Mead uses a simplex (which you can think of as a geometric blob) to move through the parameter space. It extrapolates the objective function estimated from each point in the blob in order to find a new test point. If the new point is better than the existing point then it replaces the old point with the new one. To illustrate this I have included Ben Frederickson's excellent visualisation on Nelder-Mead optimisation below.
The contraction and expansion rate are hyperparameters that effectively vary the next step size. Usually these hyperparameters are not available for you to vary
You can see more of the 'geometric blob' behaviour when considering a multivariate problem, like the ones below:
Gradient descent does exactly what it says on the tin. It estimates the gradient \(\nabla{}F(X_i)\) of the objective function \(F(x)\) at the current point, and steps towards the gradient. You don't have to write down the derivative yourself: the gradient can be calculated by symbolic diffferentiation, or using automatic differentiation. With a fixed step size \(\alpha\) the current point \(X_i\) is updated by: \[ X_{i + 1} = X_{i} - \alpha\nabla{}F\left(X_i\right) \]
The step size \(\alpha\) (or more commonly known as the learning rate defines how gradient descent will reach the local minima. If the rate is too low then the algorithm will take too long to find the solution, taking tiny steps towards as it moves through parameter space. If the learning rate is too high it can jump around the global minima and never converge. In both cases you end up needing other optimisation hyperparameters (e.g., maximum number of evaluations) to make the algorithm stop.
One modification that helps with this problem is to include a line search at every evaluation. This will do a mini-search at every evaluation and adjust the step size \(\alpha\) such that the objective function is always decreasing. That means it will take larger steps through relatively flat parts of parameter space, and smaller steps when the gradient is changing rapidly. In practice that means fewer iterations (steps), but overall could mean more evaluations of the objective function in order to iteratively set the step size.
Knowing the gradient of the objective function is clearly useful for proposing the next step. But there are many functions where the gradient is insufficient for efficiently moving through the parameter space. Rosenbrocks function, \[ f(x, y) = (1 - x)^2 + 100(y - x^2)^2 \] is one such example, where gradient descent performs poorly. For this (and many other) kinds of problems, what we actually want is the curvature of the objective function.
There are many ways to approximate this. One popular method, dubbed conjugate gradient descent, is to use the gradient evaluations from successive iterations to improve the proposal step. That way you're not just using the instantaneous gradient at the point being evaluated: you're also using the instantaneous gradients from the nearby, previous evaluations, which is giving you some estimate of the local curvature. See the example below where conjugate gradient descent is used with Rosenbrocks function, compared to the performance above using only gradient descent.
Gradient descent, and conjugate gradient descent, can both be described as first order methods because they make use of gradients (first derivatives) to minimise the objective function. A Rosenbrocks function shows that even if you have the gradient information, it is not always enough to estimate the curvature of parameter space. Even if you are doing conjugate gradient descent, where you have estimates of the gradient from successive iterations, you are still only using the first order information at each step.
A second order method is one that makes use of the second derivatives (and first derivatives, and the objective function) to optimise an objective function. Formally, second order methods use the Hessian,
BFGS is actually a quasi-Newton method, which means you don't have to supply a function to compute the Hessian (or it's transpose, the Jacobian). The Hessian can instead be approximated from the gradient evaluations (or approximate gradient evaluations). It can be expensive to store these approximations, so there is a commonly-used limited memory variation of this method (L-BFGS-B) that only stores the last \(m\) estimates.
Note that even with the Hessian you still have to take an appropriate step size, which means you need to decide on how far you are willing to step in parameter space.
Personal opinion: BFGS is the optimisation algorithm I use most commonly in my research, unless I am training a neural network.
The adaptive moment estimation (Adam) method is commonly used for optimising parameters of deep neural networks. It has an adaptive learning rate (like what we have seen above) and is specifically designed for training deep neural networks. It is also a relatively new optimisation algorithm
Adam uses estimates of the first and second moments of the gradients to compute individual learning rates for every parameter in a neural network. By using moving averages that are calculated for each batch of data, Adam has a very simple way to adaptively update the weights of a neural network. Consider the following pseudo-code:
That seems too simple to be true! Yet it is. The base version of Adam showed great potential to decrease training costs for deep neural networks. And it still does a reasonably good job for most neural networks. However, there are many slight variations that exist. If you experience unusual results with Adam (e.g., not converging to an optimal solution), you may want to consider trying alternative optimisation routines.
Now that you have an idea for how some of these commonly used optimisation algorithms work, let's consider some common problems with optimisation. Because there are many!
We will start off with an aside by saying that the problems below specifically refer to optimisation of continuous variables. Optimisation of other kinds of variables (e.g., integer optimisation) have their own set of curses, including the ones listed below. For the purposes of this course we will focus on optimisation of continuous parameters, rather than discrete ones.
This is probably the curse that people overlook the most, to their peril! Let's assume that you only have five unknown parameters that you need to optimise. You're not yet plagued by the curse of dimensionality. When do you tell your optimisation procedure to stop optimising?
In practice optimisation will continue until some threshold of convergence is met, or the optimisation fails. Optimisation can fail, and it often does so silently. That means anytime you optimise something you need to check the outputs of that optimisation procedure to establish whether it met a convergence threshold, or if the process failed.
It's also not enough to simply say that "the optimisation converged" because you have to specify the threshold for what qualifies as convergence. Even if you don't explicitly set it, there will be some default value that is already set for you. The exact definition of convergence varies depending on the algorithm being used, but normally you could think of it as something like \[ |\Delta{}f(x)| < \delta \] where \(\delta\) is some small value that we have specified, and \(\Delta{}f(x)\) is the change in the objective function between two successive iterations. That makes sense. But what if we were to just start with infintesimally small steps in \(x\)? The objective function \(f(x)\) will not change much at all between these two steps, which means that our optimisation procedure could have formally met convergence criteria but not actually have moved anywhere! For this reason it is a good idea to check that the optimised parameters do not exactly match your input parameters! Otherwise your optimisation algorithm may not have actually done anything! For these reasons it is important to be aware of the step size between iterations, and the relative scaling between different parameters.
See the example we started off with in this class. Even if you can intelligently move through parameter space, you will find that in higher dimensions volume can play funny tricks on you. Even as you intelligently move through the billions of parameters you have, you will still be plauged by the curse of dimensionality. As you go to higher dimensions you will start to feel the pain of optimisation: there are simply too many parameters to even try to map out where to step next, or understand which parameters are more important.
Gridding is always the dumbest thing one can do (if you can afford it)
If you don't have an efficient optimisation algorithm then you may also want to fix certain parameters (to values that you think are reasonable, or good guesses) and then try to optimise the other parameters. You may even need to optimise the other dimensions one at a time, leaving all other dimensions fixed to the best-found value so far. This is not a fool-proof method, however, as this could still cause you lots of problems!
The relative scaling between parameters can be very important. Imagine that you had two parameters that you were going to optimise. One varied between \(-1\) and \(+1\), and the other could vary from \(1\) to \(10^8\). That's seven orders of magnitude difference in the peak-to-peak ranges of those two parameters. Why is this important?
Imagine if the stopping criteria for your optimisation algorithm didn't just depend on the change in the objective function \(\delta\), but also depended on how much the parameters changed between two successive iterations. In that case it is going to really matter how much parameter 1 changes compared to parameter 2: if parameter 1 were to change by 0.001, the same relative scaling in parameter 2 is 100,000! Would you want your optimisation algorithm to stop when the difference in successive steps is 0.001 in parameter 1 and parameter 2? Assuming the same scaling, then by the time the step size in parameter 2 reaches 0.001, the step size in parameter 1 would be \(10^{-11}\)!
For these reasons, you should carefully consider the relative scalings between different parameters. You want the relative scalings between parameters to be about unity.
If your objective function is convex, it's a good thing. Convex optimisation is a special sub-set of optimisation, and in short you could say that if your problem is convex then the parameters you find through optimisation are mathematically guaranteed to be the global minima. Why is that special? Isn't optimisation always going to bring me to the global minima?
The answer is obviously no. You may have an objective function that looks like a hilly landscape, with peaks and troughs spanning throughout a high dimensional parameter space. For these purposes we will assume you are trying to maximise the function. If you try to optimise an objective function like that then most likely you will end up at a local maxima: a local hill in parameter space. And then your optimisation procedure will stop. But there may be nearby hills that are far taller than the one the optimisation has stopped at! Basin-hopping or simulated annealing can be good for these kinds of problems.
But if your problem is convex, you never have to worry about these kinds of problems: any local minima is guaranteed to be the global minima.
If your problem is convex then initialisation doesn't matter. If your problem is non-convex then initialisation really matters. Optimisation isn't guaranteed to give you the correct global minima, even if your relative parameter scaling is good and even if you have a sensible convergence threshold. You can still find that it will optimise to some local solution, whereas you know that there is a better solution elsewhere in parameter space.
Consider the function
\[
f(x, y) = x^2 + y^2 - a\exp{\left(-\frac{(x - 1)^2 + y^2}{c}\right)} - b\exp{\left(-\frac{(x + 1)^2 + y^2}{d}\right)}
\]
where you can see there are two minima: at \((1, 0)\) and \((-1, 0)\). Four common optimisation algorithms that are used in deep learning are indicated: stochastic gradient descent (SGD), momentum, RMSProp, and Adaptive Moment Estimation
What decides whether the optimisation algorithm will find a local minima (there can be many local minima) or a global one? The initialisation, and the algorithm itself. You should always try to initialise at a sensible value. If you don't know what that value is, you may want to initialise at many different locations and run multiple optimisation procedures, and only save the best one. This is a very common thing to do in practice.
There are many options available when you need to optimise some objective function. The choice of algorithm, its hyper-parameters that define the step size, initialisation value, and convergence thresholds, are all options that demand your decision. You should do your best to chose the most appropriate tool for the job, and you should always be skeptical that optimisation has quietly failed!
In the end, optimisation only provides you with a point estimate of the minimum of the given objective function. But in some scenarios a point estimate is not enough! We sometimes want to draw samples from a distribution and use those samples to quote percentiles (e.g., the 5th, 50th, and 95th percentile). In the next class we will explore how to draw samples from a distribution using Markov-Chain Monte Carlo sampling.
The optimisation visualisations for Nelder Mead, Gradient Descent, Conjugate Gradient Descent, and Adam were developed by Ben Frederickson. See his excellent fmin javascript library.
The visualization of different optimization algorithms used in deep learning was created by Emilien Dupont (block aaf429).