Optimisation

In previous lectures we used scipy.optimize.minimize as a black box to find MAP estimates. We called it, it returned a set of best-fit parameters, and we moved on. This lecture opens the box. Understanding how optimisation algorithms work is essential because the choice of algorithm, learning rate, and starting point can mean the difference between a correct answer and a silently wrong one.

The problem is general: given an objective function $f(\boldsymbol{\theta})$, find the parameter vector $\boldsymbol{\theta}^*$ that minimises $f$. In our context, $f$ is almost always the negative log-posterior:

$$f(\boldsymbol{\theta}) = -\log p(\boldsymbol{\theta} \mid \text{data}) = -\log \mathcal{L}(\boldsymbol{\theta}) - \log p(\boldsymbol{\theta}) + \text{const}$$

Minimising $f$ is equivalent to maximising the posterior — the MAP estimate. The same algorithms apply to minimising $\chi^2$, minimising a loss function in machine learning, or any other scalar objective.


1. Why Not Just Grid Search?

The simplest conceivable optimisation strategy is to evaluate $f$ on a grid over all parameters and pick the point with the lowest value. With $N$ grid points per dimension and $D$ parameters, this requires $N^D$ evaluations.

Figure 1 shows why this fails immediately.

Figure 1

Figure 1: Grid search scaling with $N = 10$ grid points per dimension. At $D = 6$ parameters (a simple model by astronomical standards), the grid has $10^6$ points — feasible, but crude. At $D = 10$, it requires $10^{10}$ evaluations, beyond what a desktop can do. Real problems in spectral fitting, photometric redshifts, or stellar population synthesis routinely have $D = 10$–$50$ parameters.

To put this in perspective: a galaxy SED fitting code like Prospector has $\sim$14 free parameters. A grid with just 10 points per dimension would need $10^{14}$ evaluations — at 1 microsecond each, that is 3 years of continuous computation. Yet modern optimisers solve this in minutes. The rest of this lecture explains how.

ℹ️
The curse of dimensionality. The volume of parameter space grows exponentially with dimension. Grid search, which scales as $N^D$, is only viable for $D \lesssim 3$. Every method discussed below is designed to avoid this exponential scaling.

2. Gradient-Free Methods: Nelder-Mead

Not every objective function has an easily computable gradient. The function might be a black-box simulation, or the code that evaluates it might not be differentiable. In these cases, gradient-free (or derivative-free) methods are the only option.

2.1 The Downhill Simplex

The Nelder-Mead algorithm (also called the downhill simplex method) maintains a simplex — a geometric figure with $D + 1$ vertices in $D$-dimensional parameter space. In 2D, the simplex is a triangle; in 3D, a tetrahedron.

At each iteration, the algorithm evaluates $f$ at all vertices, identifies the worst vertex (highest $f$), and attempts to replace it with a better point using a sequence of geometric operations:

  1. Reflection: reflect the worst vertex through the centroid of the remaining vertices.
  2. Expansion: if the reflected point is the new best, try going further in the same direction.
  3. Contraction: if the reflected point is still bad, try a point between the worst vertex and the centroid.
  4. Shrink: if nothing works, shrink the entire simplex toward the best vertex.

The simplex deforms and crawls downhill through parameter space like an amoeba, adapting its shape to the local landscape.

Figure 2

Figure 2: Nelder-Mead iterates on a modified Rosenbrock function. The dashed purple triangles illustrate how the simplex deforms as it moves through parameter space. The algorithm requires no gradient information — only function evaluations at the vertices.

2.2 When to Use Nelder-Mead

Nelder-Mead is the method='Nelder-Mead' option in scipy.optimize.minimize. Use it when:

  • The objective function has no available gradient (e.g., it calls an external simulation).
  • The number of parameters is small ($D \lesssim 10$).
  • You need a quick-and-dirty first answer.

For problems with available gradients, gradient-based methods are far more efficient. Nelder-Mead makes no convergence guarantees in general and can stall on flat or noisy landscapes.

from scipy.optimize import minimize

def neg_log_posterior(theta):
    # Your objective function here
    ...

result = minimize(neg_log_posterior, x0=[1.0, 2.0], method='Nelder-Mead')
print(result.x, result.fun, result.success)

3. First-Order Methods: Gradient Descent

3.1 The Update Rule

If $f$ is differentiable, we can compute the gradient $\nabla f(\boldsymbol{\theta})$, which points in the direction of steepest ascent. Moving in the opposite direction decreases $f$ most rapidly. This gives the gradient descent update rule:

$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \alpha \nabla f(\boldsymbol{\theta}_k)$$

where $\alpha > 0$ is the learning rate (or step size).

3.2 Derivation from Taylor Expansion

The update rule follows from a first-order Taylor expansion. Near $\boldsymbol{\theta}_k$:

$$f(\boldsymbol{\theta}_k + \boldsymbol{\delta}) \approx f(\boldsymbol{\theta}_k) + \nabla f(\boldsymbol{\theta}_k)^\top \boldsymbol{\delta}$$

We want to choose $\boldsymbol{\delta}$ to decrease $f$ as much as possible. For a fixed step size $|\boldsymbol{\delta}| = \alpha$, the direction that minimises $\nabla f^\top \boldsymbol{\delta}$ is $\boldsymbol{\delta} = -\alpha \hat{\nabla} f$, i.e., the negative gradient direction. This recovers the update rule above.

3.3 Gradient Descent on a Quadratic

On a simple quadratic bowl, gradient descent converges smoothly to the minimum.

Figure 3

Figure 3: Gradient descent on $f(\theta_1, \theta_2) = \frac{1}{2}(\theta_1^2 + 3\theta_2^2)$. The red arrows show the negative gradient direction at selected points. The iterates converge steadily toward the minimum.

3.4 The Learning Rate Problem

The learning rate $\alpha$ is the most important hyperparameter in gradient descent, and getting it wrong is catastrophic.

Figure 4

Figure 4: Effect of learning rate on gradient descent. Left: $\alpha$ too small — progress is agonisingly slow. Centre: good $\alpha$ — smooth convergence. Right: $\alpha$ too large — the iterates oscillate wildly and may diverge.

⚠️
There is no universally correct learning rate. The optimal $\alpha$ depends on the curvature of $f$, which varies across parameter space. This is the fundamental limitation of plain gradient descent, and it motivates the more sophisticated methods below.

3.5 The Elongated Valley Problem

When the objective function has very different curvatures along different directions — as happens whenever parameters have different physical scales — gradient descent zigzags along the narrow direction and makes almost no progress along the broad direction.

Figure 5

Figure 5: Gradient descent on $f = \frac{1}{2}(\theta_1^2 + 50\theta_2^2)$, an elongated valley with a 50:1 aspect ratio. The gradient points mostly across the valley (the steep direction) rather than along it (the direction we need to go). The result is a zigzag path that wastes most of its iterations.

This pathology is not contrived. Any time your model has parameters on very different scales — say, a flux amplitude in janskys and a redshift — the negative log-posterior will have exactly this elongated structure.


4. Conjugate Gradient

The conjugate gradient method fixes the zigzagging problem of steepest descent by choosing search directions that are conjugate with respect to the Hessian matrix $\mathbf{H}$. Two directions $\mathbf{d}_i$ and $\mathbf{d}_j$ are conjugate if $\mathbf{d}_i^\top \mathbf{H} \mathbf{d}_j = 0$.

The key idea: once the algorithm has minimised along direction $\mathbf{d}_i$, it will never need to revisit that direction. Steepest descent violates this by constantly revisiting the same directions (the zigzag), but conjugate gradient does not.

For a $D$-dimensional quadratic, conjugate gradient converges in exactly $D$ steps — dramatically better than the $\mathcal{O}(D \cdot \kappa)$ steps needed by steepest descent, where $\kappa$ is the condition number (the ratio of the largest to smallest eigenvalue of the Hessian).

In scipy.optimize, this is method='CG':

result = minimize(neg_log_posterior, x0, method='CG', jac=gradient_function)

5. Second-Order Methods: Newton’s Method

5.1 The Quadratic Approximation

Gradient descent uses a first-order (linear) approximation to $f$. Newton’s method uses a second-order (quadratic) approximation:

$$f(\boldsymbol{\theta}_k + \boldsymbol{\delta}) \approx f(\boldsymbol{\theta}_k) + \nabla f_k^\top \boldsymbol{\delta} + \frac{1}{2} \boldsymbol{\delta}^\top \mathbf{H}_k \boldsymbol{\delta}$$

where $\mathbf{H}_k = \nabla^2 f(\boldsymbol{\theta}_k)$ is the Hessian matrix of second derivatives. Setting the gradient of this quadratic to zero:

$$\nabla f_k + \mathbf{H}_k \boldsymbol{\delta} = \mathbf{0} \quad \Rightarrow \quad \boldsymbol{\delta} = -\mathbf{H}_k^{-1} \nabla f_k$$

The Newton update is:

$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \mathbf{H}_k^{-1} \nabla f(\boldsymbol{\theta}_k)$$

5.2 Why It Is Fast

For a pure quadratic function, Newton’s method converges in one step — the quadratic approximation is exact, so solving it gives the true minimum. For non-quadratic functions near the minimum (where the quadratic approximation is good), Newton’s method converges quadratically — the error squares at each iteration.

Figure 6

Figure 6: Newton’s method (red squares) vs gradient descent (blue dots) on the same elongated quadratic. Newton’s method reaches the minimum in a single step because it uses curvature information from the Hessian to correct for the elongation. Gradient descent takes 80+ steps and still zigzags.

5.3 The Cost

Newton’s method has two major costs:

  1. Computing the Hessian. For $D$ parameters, the Hessian is a $D \times D$ matrix requiring $\mathcal{O}(D^2)$ second derivatives. For many problems, these are expensive or unavailable.

  2. Inverting the Hessian. Solving the linear system $\mathbf{H} \boldsymbol{\delta} = -\nabla f$ costs $\mathcal{O}(D^3)$ operations.

For $D \lesssim 10$, Newton’s method is practical and very fast. For $D \gtrsim 100$, the cost of the Hessian becomes prohibitive.


6. Quasi-Newton Methods: BFGS and L-BFGS-B

6.1 The Idea

Quasi-Newton methods approximate the Hessian (or its inverse) using only gradient evaluations, avoiding the $\mathcal{O}(D^2)$ cost of computing second derivatives. The most widely used is BFGS (Broyden-Fletcher-Goldfarb-Shanno).

BFGS maintains an approximation $\mathbf{B}_k \approx \mathbf{H}_k^{-1}$ and updates it at each step using the change in gradient between successive iterations. The update satisfies the secant condition: the approximate Hessian must be consistent with the observed change in gradient.

6.2 L-BFGS-B: The Practical Workhorse

For large problems, storing the full $D \times D$ matrix $\mathbf{B}_k$ is impractical. L-BFGS-B (limited-memory BFGS with box constraints) stores only the last $m$ gradient differences (typically $m = 10$) and reconstructs the search direction on the fly. It also supports bound constraints on individual parameters (e.g., requiring a variance to be positive).

result = minimize(
    neg_log_posterior, x0,
    method='L-BFGS-B',
    jac=gradient_function,      # supply the gradient for best performance
    bounds=[(0, None), (None, None)],  # e.g., first param >= 0
)
ℹ️
L-BFGS-B is the optimisation algorithm I use most commonly in my research. It combines the fast convergence of Newton-type methods with $\mathcal{O}(D)$ memory, handles bound constraints naturally, and works well for $D$ up to thousands. For most astronomy parameter estimation problems, this should be your first choice.

6.3 Providing Gradients

All gradient-based methods benefit from an analytically computed gradient rather than a finite-difference approximation. If you do not provide a jac argument, scipy.optimize will estimate the gradient by evaluating $f$ at $D$ additional points per iteration — doubling the cost and introducing numerical noise.

For a negative log-posterior, the gradient is often straightforward to derive:

import numpy as np

def neg_log_post_and_grad(theta, x, y, y_err):
    """Return (f, grad_f) for use with jac=True."""
    m, b = theta
    model = m * x + b
    residual = y - model
    w = 1.0 / y_err**2

    f = 0.5 * np.sum(w * residual**2)           # negative log-likelihood (up to const)
    dfdm = -np.sum(w * residual * x)
    dfdb = -np.sum(w * residual)

    return f, np.array([dfdm, dfdb])

result = minimize(neg_log_post_and_grad, x0=[1.0, 200.0],
                  args=(x, y, y_err), method='L-BFGS-B', jac=True)

7. Stochastic Methods and Adam

When the dataset is very large (millions of data points), evaluating the full likelihood at every step is expensive. Stochastic gradient descent (SGD) approximates the gradient using a random subset (mini-batch) of the data:

$$\nabla f \approx \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla f_i$$

This introduces noise into the gradient estimate, but the expected direction is still correct, and each step is much cheaper.

Adam (Adaptive Moment Estimation) extends SGD by maintaining per-parameter adaptive learning rates using exponential moving averages of the gradient and its square:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \qquad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

where $\hat{m}_t$ and $\hat{v}_t$ are bias-corrected estimates. Adam effectively gives each parameter its own learning rate, adapting to the local curvature.

ℹ️
When to use Adam vs L-BFGS-B. Adam dominates in machine learning (neural networks, large datasets, stochastic objectives). For the kind of problems common in astronomical data analysis — moderate $D$, exact gradients, the full dataset fits in memory — L-BFGS-B is almost always faster and more reliable. Use Adam when the dataset is too large to evaluate the full gradient at each step.

8. Practical Problems: The Curses of Optimisation

8.1 The Curse of Silent Failure

Optimisers can fail without raising an error. The result object from scipy.optimize.minimize contains a success flag and a message — always check them.

result = minimize(neg_log_posterior, x0, method='L-BFGS-B')

if not result.success:
    print(f"WARNING: optimisation failed! {result.message}")
else:
    print(f"Converged to f = {result.fun:.4f} at theta = {result.x}")
⚠️
Never trust an optimisation result without checking convergence. A common failure mode: the algorithm hits maxiter and stops at whatever point it reached. The result looks plausible but is not a minimum. Check result.success, result.message, and verify that the gradient norm at the solution is small.

8.2 The Curse of Scaling

When parameters live on very different scales, the contours of $f$ are extremely elongated, and gradient-based methods struggle (as we saw in Figure 5).

Figure 7

Figure 7: Left: objective function with parameters on very different scales (e.g., flux in Jy vs redshift). The contours are elongated by a factor of $\sim$1000:1, and gradient descent zigzags. Right: after reparameterising to standardised coordinates, the contours are roughly circular and gradient descent converges in a few steps.

The fix: reparameterise. Before optimising, transform each parameter so that a unit change in the transformed parameter corresponds to a “typical” change in the objective function:

# Bad: parameters on different scales
theta = [flux_jy, redshift]     # flux ~ 1e-3, redshift ~ 1.0

# Better: standardise
theta_scaled = [flux_jy / flux_scale, redshift / z_scale]

Alternatively, pass the options={'disp': True} flag to see whether the algorithm is making progress, or use L-BFGS-B which partially adapts to different scales through its Hessian approximation.

8.3 The Curse of Non-Convexity

A convex function has a single minimum — any local minimum is the global minimum. Most real objective functions are not convex and have multiple local minima. An optimiser that starts near one minimum will converge to it, never knowing that a deeper minimum exists elsewhere.

Figure 8

Figure 8: A non-convex objective function with two local minima. Start A (blue) converges to the global minimum near the origin. Start B (orange) converges to the local minimum near $(2.5, 1.5)$. Neither run is aware of the other minimum.

Strategies for non-convex problems:

  1. Multi-start: run the optimiser from many different starting points and keep the best result.
  2. Basin-hopping: scipy.optimize.basinhopping combines local optimisation with random perturbations to escape local minima.
  3. Physical intuition: use your understanding of the problem to choose a starting point near the global minimum.
  4. Simulated annealing: scipy.optimize.dual_annealing for particularly rugged landscapes.
from scipy.optimize import basinhopping

result = basinhopping(neg_log_posterior, x0, minimizer_kwargs={'method': 'L-BFGS-B'},
                      niter=100, seed=42)

8.4 The Curse of Initialisation

The starting point $\boldsymbol{\theta}_0$ matters more than most people realise. For non-convex problems, it determines which minimum you find. Even for convex problems, a poor starting point means more iterations.

Best practice: use a physically motivated initial guess. For a spectral line fit, start near the observed peak. For a light curve model, start near the period found by a Lomb-Scargle periodogram. For a linear model, start at the ordinary least-squares solution.


9. Which Optimiser Should I Use?

For most problems in astronomical data analysis, the decision is straightforward:

SituationRecommended methodscipy call
Small $D$, no gradient availableNelder-Meadmethod='Nelder-Mead'
Moderate $D$, gradient availableL-BFGS-Bmethod='L-BFGS-B'
Moderate $D$, bound constraintsL-BFGS-Bmethod='L-BFGS-B', bounds=...
Non-convex, multiple local minimaBasin-hoppingbasinhopping(..., minimizer_kwargs=...)
Very large dataset, stochasticAdam (via PyTorch/JAX)Not in scipy
Quadratic / linear modelNormal equationsnp.linalg.lstsq

The convergence comparison in Figure 9 shows the practical differences.

Figure 9

Figure 9: Convergence of five algorithms on the Rosenbrock function $f(x,y) = (1-x)^2 + 100(y-x^2)^2$, starting from $(-1.5, 2.0)$. BFGS and L-BFGS-B converge rapidly by approximating the Hessian. Conjugate gradient is somewhat slower. Nelder-Mead (no gradient) converges but takes many more iterations. Plain gradient descent with a fixed learning rate makes slow progress.

ℹ️
Default recommendation. Unless you have a specific reason not to, start with L-BFGS-B. If the gradient is not available, use Nelder-Mead. If the problem is non-convex, wrap L-BFGS-B inside basinhopping. If the problem is linear, use the normal equations from Lecture 7.

10. Summary

Optimisation is how we find best-fit parameters. Given an objective function $f(\boldsymbol{\theta})$ — typically the negative log-posterior — we seek $\boldsymbol{\theta}^*$ that minimises $f$.

Grid search fails exponentially. With $D$ parameters and $N$ grid points per dimension, the cost is $N^D$. For $D > 5$, this is infeasible.

Gradient-free methods like Nelder-Mead work without derivatives but are slow and unreliable for high dimensions. Use them only when no gradient is available.

Gradient descent moves in the direction of steepest descent: $\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \alpha \nabla f_k$. It suffers from learning rate sensitivity and zigzagging in elongated valleys.

Newton’s method uses the Hessian for a quadratic approximation: $\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \mathbf{H}^{-1} \nabla f_k$. It converges quadratically near the minimum but is expensive for large $D$.

Quasi-Newton methods (BFGS, L-BFGS-B) approximate the Hessian from gradient information alone. L-BFGS-B uses $\mathcal{O}(D)$ memory and is the workhorse for most problems.

Practical curses will bite you: silent convergence failures (check result.success), parameter scaling (reparameterise), non-convexity (multi-start or basin-hopping), and initialisation (use physical intuition).

Optimisation gives a point estimate. It finds $\boldsymbol{\theta}^*$ but does not tell you how uncertain that estimate is. For uncertainty quantification, you need the full posterior distribution — which is the subject of the next lecture on MCMC.


Further Reading

  • Nocedal, J. & Wright, S. J. (2006). Numerical Optimization, 2nd ed. Springer. [The definitive reference — Chapter 3 (line search), Chapter 6 (quasi-Newton), Chapter 7 (conjugate gradient)]
  • Press, W. H., et al. (2007). Numerical Recipes, Chapter 10. [Practical algorithms with code, including Nelder-Mead and conjugate gradient]
  • Scipy documentation: scipy.optimize.minimize. [Lists all available methods with their requirements and options]
  • Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. [The Adam paper — short and widely cited]
Last updated on