Gaussian Processes I

From Parametric Models to Infinite-Dimensional Priors

Every model we have built so far has been parametric: you specify a functional form — a line, a polynomial, a mixture of Gaussians — and the data determine the values of a finite set of parameters. The functional form itself is fixed by assumption. Gaussian processes (GPs) remove this constraint. A GP places a prior probability over functions, not over parameters, and lets the data decide which functions are plausible. The result is a flexible, probabilistic model that provides calibrated uncertainty estimates everywhere — including in regions far from any observation.

This lecture builds up the GP framework from first principles. We start from the familiar territory of linear models (lecture 7), show how marginalising over weights leads naturally to the kernel trick and the GP, then develop the function-space view that provides the geometric intuition. We introduce kernels, draw samples from the prior, and derive the conditioning equations that produce predictions given data. The next lecture covers practical implementation: hyperparameter optimisation, computational costs, and real astronomical applications.

Regression is used to find a function that represents a set of data points as closely as possible. A Gaussian process lets you do this without committing to a specific functional form — click the data points above to see how. Interactive visualisation from Görtler, Kehlbeck & Deussen (2019).

1. Why Go Non-Parametric?

1.1 The Limits of Parametric Models

In lectures 5 and 7, we fitted straight lines and polynomial models to data. These are powerful when the true underlying relationship really is (approximately) a polynomial. But consider three situations where they break down:

  1. Unknown functional form. A stellar light curve during a flare has no simple closed-form expression. You could try a sum of exponentials, but how many? With what constraints?

  2. Nuisance signals. Stellar activity imprints quasi-periodic, slowly-varying signals on radial velocity data. You want to model this correlated noise to remove it — but writing down a parametric model for “stuff the star does” is, at best, fragile.

  3. Interpolation with honest uncertainties. You observe a supernova at irregular cadence and need to interpolate the light curve to estimate peak brightness. A polynomial can interpolate, but its uncertainty estimates away from the data are controlled by the polynomial degree, not by the data themselves.

In each case, the problem is the same: a parametric model forces you to commit to a functional form before seeing the data. If that form is wrong, the uncertainties you derive are wrong too — not just in magnitude, but in kind.

1.2 The Non-Parametric Alternative

A Gaussian process sidesteps the functional-form problem entirely. Instead of choosing $f(x) = mx + b$ or $f(x) = \sum_k w_k \phi_k(x)$ and fitting coefficients, a GP defines a probability distribution over functions. The “prior” is not over a parameter vector $\boldsymbol{\theta}$ but over the space of all smooth functions $f: \mathbb{R} \to \mathbb{R}$ (subject to regularity conditions encoded by the kernel).

ℹ️
“Non-parametric” is a misnomer. A GP does have parameters — the kernel hyperparameters that control smoothness, amplitude, and correlation length. The term means that the number of effective parameters grows with the data, rather than being fixed in advance. A GP with $N$ training points uses an $N \times N$ covariance matrix, so the model complexity adapts to the dataset size.

2. The Weight-Space View

We arrive at Gaussian processes by starting from something you already know: a parametric linear model with a prior on its weights. The path is: linear model → Bayesian linear model → marginalise over weights → kernel trick → Gaussian process.

2.1 From Linear Models to Feature Spaces

In lecture 7 we wrote the linear model as:

$$f(\mathbf{x}) = \mathbf{A}\mathbf{w}$$

where $\mathbf{A}$ is the $N \times M$ design matrix of basis functions and $\mathbf{w}$ is the $M$-dimensional weight vector. The basis functions could be polynomials ($1, x, x^2, \ldots$), Fourier modes, radial basis functions, or anything else.

Place a Gaussian prior on the weights: $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Lambda})$, where $\boldsymbol{\Lambda}$ is the prior covariance. The log-probability of the data is:

$$\log p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}, \mathbf{w}) = -\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu}(\boldsymbol{\theta}) - \mathbf{A}\mathbf{w})^\top \mathbf{C}^{-1} (\mathbf{y} - \boldsymbol{\mu}(\boldsymbol{\theta}) - \mathbf{A}\mathbf{w}) - \frac{1}{2}\log|2\pi\mathbf{C}|$$

where $\mathbf{C}$ is the data covariance and $\boldsymbol{\mu}(\boldsymbol{\theta})$ is a mean model with parameters $\boldsymbol{\theta}$.

2.2 Marginalising Over the Weights

We do not want to infer $\mathbf{w}$ — it is a nuisance parameter. Marginalise it out:

$$p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}) = \int p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}, \mathbf{w}) p(\mathbf{w}) d\mathbf{w}$$

Both the likelihood and the prior are Gaussian, so the integral is analytic (a Gaussian integrated against a Gaussian is a Gaussian — the same trick from lecture 5, Section 9). The result is:

$$p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}) = \mathcal{N}\big(\mathbf{y} \big| \boldsymbol{\mu}(\boldsymbol{\theta}),; \mathbf{C} + \mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top\big)$$

The key quantity is the effective covariance:

$$\mathbf{K}_{\text{eff}} = \mathbf{C} + \mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top$$

The matrix $\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top$ is an $N \times N$ matrix whose $(i,j)$ entry is:

$$[\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top]{ij} = \sum{m=1}^{M} \sum_{m’=1}^{M} A_{im} \Lambda_{mm’} A_{jm’} = \boldsymbol{\phi}(x_i)^\top \boldsymbol{\Lambda} \boldsymbol{\phi}(x_j)$$

where $\boldsymbol{\phi}(x_i)$ is the vector of basis functions evaluated at $x_i$.

2.3 The Kernel Trick

Here is the insight that turns a finite-dimensional linear model into a Gaussian process. The entry $[\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top]_{ij}$ depends on the inputs $x_i$ and $x_j$ only through the inner product $\boldsymbol{\phi}(x_i)^\top \boldsymbol{\Lambda} \boldsymbol{\phi}(x_j)$. Define:

$$k(x_i, x_j) \equiv \boldsymbol{\phi}(x_i)^\top \boldsymbol{\Lambda} \boldsymbol{\phi}(x_j)$$

This is a kernel function — it takes two inputs and returns a scalar measuring their “similarity” in feature space. Now ask: what happens as $M \to \infty$? As the number of basis functions grows, you cannot explicitly construct $\mathbf{A}$ or $\boldsymbol{\Lambda}$. But you never need to — you only need the kernel $k(x_i, x_j)$, which replaces the inner product of two infinite-dimensional feature vectors with a single function evaluation.

$$[\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top]_{ij} \to k(x_i, x_j)$$

This is the kernel trick: replace an inner product of (potentially infinite-dimensional) feature vectors with a kernel function. The GP is the infinite-basis-function limit of a Bayesian linear model.

ℹ️
Not every function is a valid kernel. The matrix $\mathbf{K}$ with entries $K_{ij} = k(x_i, x_j)$ must be positive semi-definite for every possible set of inputs. This is guaranteed by Mercer’s theorem: any function that can be written as an inner product in some feature space is a valid kernel. The squared exponential, Matern, and periodic kernels all satisfy this condition.

2.4 A Concrete Example: From Polynomials to Kernels

To make the kernel trick tangible, consider a quadratic basis: $\boldsymbol{\phi}(x) = (1, x, x^2)^\top$ with prior $\boldsymbol{\Lambda} = \text{diag}(\sigma_0^2, \sigma_1^2, \sigma_2^2)$. The kernel is:

$$k(x, x’) = \sigma_0^2 + \sigma_1^2 x x’ + \sigma_2^2 x^2 x’^2$$

This is just the covariance between $f(x)$ and $f(x’)$ that arises from the weight prior. For three basis functions, you could write out $\mathbf{A}$ and compute $\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top$ explicitly. For 1000 RBF basis functions, you still could. For infinitely many? You cannot — but the kernel $k(x, x’)$ still exists and can be evaluated.

import numpy as np

def polynomial_kernel(x1, x2, sigmas=(1.0, 1.0, 0.5)):
    """Kernel from a quadratic basis with diagonal weight prior."""
    s0, s1, s2 = sigmas
    return s0**2 + s1**2 * np.outer(x1, x2) + s2**2 * np.outer(x1**2, x2**2)

# Verify: compare kernel matrix to A @ Lambda @ A.T
x = np.array([0.5, 1.0, 2.0, 3.0])
A = np.column_stack([np.ones_like(x), x, x**2])
Lambda = np.diag([1.0, 1.0, 0.25])

K_from_basis = A @ Lambda @ A.T
K_from_kernel = polynomial_kernel(x, x)
print("Max difference:", np.max(np.abs(K_from_basis - K_from_kernel)))
# 0.0 — they are identical

2.5 The Marginalised Log-Probability

After the kernel trick, the marginalised probability becomes:

$$\log p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}) = -\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu})^\top (\mathbf{C} + \mathbf{K})^{-1} (\mathbf{y} - \boldsymbol{\mu}) - \frac{1}{2}\log|\mathbf{C} + \mathbf{K}| - \frac{N}{2}\log 2\pi$$

where $\mathbf{K}$ is the $N \times N$ kernel matrix with entries $K_{ij} = k(x_i, x_j)$. This expression has three terms:

  1. Data fit: $(\mathbf{y} - \boldsymbol{\mu})^\top (\mathbf{C} + \mathbf{K})^{-1} (\mathbf{y} - \boldsymbol{\mu})$ — how well the model fits the data, penalised by the total covariance.
  2. Complexity penalty: $\log|\mathbf{C} + \mathbf{K}|$ — penalises complex models that require a large covariance to fit the data. This is the built-in Occam’s razor.
  3. Normalisation: $\frac{N}{2}\log 2\pi$ — a constant.
ℹ️
You have already seen this expression. The marginalised log-probability is exactly the log-evidence (marginal likelihood) that appears in Bayesian model comparison. The complexity penalty term is the Occam factor — it automatically penalises models that are more complex than the data require. We will return to this in lecture 15.

3. The Function-Space View

The weight-space derivation shows where GPs come from. The function-space view provides the intuition for how they work.

3.1 A GP Is a Distribution Over Functions

The defining statement is:

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

Concretely, suppose we have inputs $x_1, x_2, \ldots, x_N$. A GP says that the function values $(f(x_1), f(x_2), \ldots, f(x_N))$ are jointly Gaussian:

$$\begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_N) \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} m(x_1) \\ m(x_2) \\ \vdots \\ m(x_N) \end{bmatrix}, \begin{bmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_N) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_N) \\ \vdots & & \ddots & \vdots \\ k(x_N, x_1) & k(x_N, x_2) & \cdots & k(x_N, x_N) \end{bmatrix} \right)$$

The GP is fully specified by two things:

  • A mean function $m(x) = \mathbb{E}[f(x)]$, which is often set to zero (or to a simple parametric trend subtracted before fitting).
  • A covariance function (or kernel) $k(x, x’) = \text{Cov}[f(x), f(x’)]$, which encodes the prior belief about how function values at different inputs are correlated.

We write this compactly as:

$$f \sim \mathcal{GP}\big(m(x), k(x, x’)\big)$$

3.2 Intuition: Each Data Point Is a Dimension

Here is the key mental model. Think of $f(x_1)$ and $f(x_2)$ as two random variables — two axes of a multivariate Gaussian. If $x_1$ and $x_2$ are close together, the kernel $k(x_1, x_2)$ is large, meaning $f(x_1)$ and $f(x_2)$ are strongly correlated: knowing one tightly constrains the other. If $x_1$ and $x_2$ are far apart, the kernel is small, meaning the function values are nearly independent.

This is exactly how a multivariate Gaussian works — but instead of 2 or 10 dimensions, you have as many dimensions as you have input points. And the GP definition guarantees that this works for any finite collection, no matter how large.

3.3 Interactive Exploration: The Two-Dimensional Case

To build intuition, start with just two input points $x_1$ and $x_2$. The function values $f(x_1)$ and $f(x_2)$ form a bivariate Gaussian whose correlation is set by $k(x_1, x_2)$. When $x_1$ and $x_2$ are close, the correlation is high: dragging $f(x_1)$ up pulls $f(x_2)$ up with it. When they are far apart, the correlation vanishes and the values move independently.

The interactive visualisation below (from Görtler, Kehlbeck & Deussen, 2019) lets you explore this directly. On the left you see the bivariate Gaussian distribution; on the right, those two values are plotted as function values at their respective $x$ positions. Drag the points and watch how the covariance structure constrains the function.

ℹ️
Spend time with this visualisation. The entire GP framework is a generalisation of what you see here from 2 dimensions to $N$ dimensions. If you understand how the bivariate Gaussian constrains two function values through their covariance, you understand how a GP constrains an entire function through its kernel.

3.4 Observations Restrict the Set of Possible Functions

Before seeing data, the GP prior assigns probability to every smooth function consistent with the kernel. The prior is broad: many different functions are plausible. When you observe $y_1 = f(x_1) + \varepsilon_1$, you are conditioning the joint distribution on the observed value at $x_1$. This eliminates all functions that do not pass near $y_1$ at $x_1$. Each additional observation further restricts the set of plausible functions.

This is the function-space analogue of Bayes’ theorem. The prior over functions (encoded by the kernel) is updated by the likelihood of the data to produce a posterior over functions. The posterior mean is the “best prediction,” and the posterior variance quantifies the remaining uncertainty.

The interactive visualisation below lets you draw your own GP prior samples. Adjust the kernel hyperparameters and watch how the character of the sampled functions changes — each draw is one function from the GP prior.


4. Kernels: Encoding Prior Beliefs

The kernel $k(x, x’)$ is the heart of a Gaussian process. It determines the properties of the functions the GP considers plausible: their smoothness, their amplitude, their characteristic length scales, and whether they are periodic.

4.1 The Squared Exponential (RBF) Kernel

The most common kernel is the squared exponential (also called the radial basis function or RBF kernel):

$$k_{\text{SE}}(x, x’) = \sigma_f^2 \exp\left(-\frac{(x - x’)^2}{2\ell^2}\right)$$

It has two hyperparameters:

  • $\sigma_f^2$: the signal variance — controls the overall amplitude of the function. Larger $\sigma_f^2$ means the GP considers functions with larger excursions from the mean.
  • $\ell$: the length scale — controls the “wiggliness.” Nearby points (within a distance $\ell$) are strongly correlated; distant points are nearly independent.

Properties: Functions drawn from a GP with a squared exponential kernel are infinitely differentiable — they are extremely smooth. This is appropriate for many physical processes but can be too smooth for some applications (e.g., rough terrain, turbulence).

4.2 The Effect of the Length Scale

The length scale $\ell$ is arguably the single most important hyperparameter. It controls how quickly the correlation between function values decays with distance:

  • Short $\ell$: functions vary rapidly, the GP can capture fine structure, but predictions far from data revert quickly to the prior mean.
  • Long $\ell$: functions are smooth and slowly varying, the GP interpolates broadly between data points, but cannot capture sharp features.
⚠️
The length scale is measured in the units of $x$. If your inputs are in years, $\ell = 1.0$ means the function is correlated over about one year. If your inputs are in nanometres, the same $\ell = 1.0$ means correlation over one nanometre. Always think about what length scale is physically reasonable for your problem before fitting.

4.3 The Matern Kernel

The Matern family generalises the squared exponential by introducing a smoothness parameter $\nu$:

$$k_{\text{Matern}}(r) = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu} r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu} r}{\ell}\right)$$

where $r = |x - x’|$ and $K_\nu$ is the modified Bessel function of the second kind. The parameter $\nu$ controls differentiability:

  • $\nu = 1/2$: Matern-1/2 — produces functions that are continuous but not differentiable (like Brownian motion). Equivalent to the exponential kernel $k(r) = \sigma_f^2 \exp(-r/\ell)$.
  • $\nu = 3/2$: Matern-3/2 — once differentiable. A good default for many physical processes.
  • $\nu = 5/2$: Matern-5/2 — twice differentiable. Smooth but not infinitely so.
  • $\nu \to \infty$: recovers the squared exponential.
ℹ️
Matern-3/2 and Matern-5/2 are the workhorses of applied GP modelling. The squared exponential is popular in textbooks but its infinite smoothness often leads to overconfident interpolation — the posterior variance drops too quickly between data points. Matern kernels are more realistic for most physical processes.

4.4 The Periodic Kernel

For data with a known periodicity (stellar rotation, orbital motion, seasonal variation):

$$k_{\text{Per}}(x, x’) = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pi |x - x’| / P)}{\ell^2}\right)$$

where $P$ is the period. This kernel produces functions that are exactly periodic with period $P$, with the length scale $\ell$ controlling the smoothness within each period.

4.5 Kernel Summary

KernelFormulaKey hyperparameters
Squared Exponential (RBF)$k(x, x’) = \sigma_f^2 \exp\left(-\frac{(x - x’)^2}{2\ell^2}\right)$$\sigma_f^2$ (amplitude), $\ell$ (length scale)
Matern-3/2$k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{3} r}{\ell}\right) \exp\left(-\frac{\sqrt{3} r}{\ell}\right)$$\sigma_f^2$, $\ell$
Matern-5/2$k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{5} r}{\ell} + \frac{5 r^2}{3\ell^2}\right) \exp\left(-\frac{\sqrt{5} r}{\ell}\right)$$\sigma_f^2$, $\ell$
Periodic$k(x, x’) = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pix - x'
Linear$k(x, x’) = \sigma_f^2 (x - c)(x’ - c)$$\sigma_f^2$, $c$ (hinge point)

The interactive figure below lets you compare prior samples from these kernels. Adjust the hyperparameter sliders and watch how the character of the sampled functions changes.

4.6 Stationary vs Non-Stationary Kernels

A kernel is stationary if it depends only on the difference $r = x - x’$, not on the absolute positions $x$ and $x’$. All the kernels above are stationary. Stationary kernels assume the function has the same statistical properties everywhere — the same smoothness, the same amplitude, the same correlation length.

Non-stationary kernels — where $k(x, x’)$ depends on the positions themselves — allow different behaviour in different regions. These are more flexible but harder to specify and fit.

4.7 Combining Kernels

Individual kernels can be combined to express richer structure:

  • Sum: $k(x, x’) = k_1(x, x’) + k_2(x, x’)$ — the GP models the sum of two independent processes. Use this to capture a smooth long-term trend plus short-term oscillations.

  • Product: $k(x, x’) = k_1(x, x’) \times k_2(x, x’)$ — the correlation must be high in both components simultaneously. Use this to create locally periodic signals that decay over time (a quasi-periodic kernel is $k_{\text{Per}} \times k_{\text{SE}}$).

Explore kernel combinations interactively below:


5. Samples from the Prior

Drawing samples from a GP prior is straightforward and provides a powerful check on whether your kernel encodes sensible prior beliefs.

5.1 How to Sample

Given $N$ test inputs $\mathbf{x}_* = (x_1^, x_2^, \ldots, x_N^*)$:

  1. Compute the $N \times N$ covariance matrix $\mathbf{K}{**}$ with entries $[K{**}]_{ij} = k(x_i^, x_j^)$.
  2. Compute the Cholesky decomposition $\mathbf{K}_{**} = \mathbf{L}\mathbf{L}^\top$ (this is the matrix “square root”).
  3. Draw a vector of $N$ independent standard normal samples: $\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
  4. Compute $\mathbf{f}* = m(\mathbf{x}*) + \mathbf{L}\mathbf{u}$.

The resulting $\mathbf{f}_*$ is a draw from the GP prior. Repeating steps 3–4 gives multiple prior samples.

import numpy as np
from scipy.spatial.distance import cdist
from scipy.linalg import cholesky

def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    """Squared exponential (RBF) kernel."""
    sqdist = cdist(X1.reshape(-1, 1), X2.reshape(-1, 1), 'sqeuclidean')
    return variance * np.exp(-0.5 * sqdist / length_scale**2)

# Test inputs
x_star = np.linspace(0, 10, 200)

# Covariance matrix from kernel
K = rbf_kernel(x_star, x_star, length_scale=1.0, variance=1.0)
K += 1e-8 * np.eye(len(x_star))  # numerical stability (jitter)

# Cholesky decomposition
L = cholesky(K, lower=True)

# Draw 5 samples from the GP prior
n_samples = 5
u = np.random.randn(len(x_star), n_samples)
f_samples = L @ u  # each column is one function draw

5.2 Why the Cholesky Decomposition?

The Cholesky decomposition plays the same role here as it does in sampling from any multivariate Gaussian. Recall from lecture 2 that if $\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, then $\mathbf{L}\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{L}\mathbf{L}^\top) = \mathcal{N}(\mathbf{0}, \mathbf{K})$. The Cholesky factor $\mathbf{L}$ transforms uncorrelated noise into noise with the correct covariance structure.

ℹ️
The jitter term $\epsilon \mathbf{I}$ is essential. In exact arithmetic, $\mathbf{K}$ is positive semi-definite. In floating point, small eigenvalues can go negative, making the Cholesky decomposition fail. Adding a small diagonal term ($10^{-6}$ to $10^{-8}$) fixes this without materially changing the model. Think of it as adding a tiny amount of white noise. Every GP implementation does this.

5.3 Prior Predictive Checks for GPs

Just as we performed prior predictive checks for parametric models in lecture 4, you should always visualise GP prior samples before fitting data. If the prior samples look wildly different from what you expect the data to look like — functions varying on the wrong scale, with the wrong amplitude, or with the wrong smoothness — then your kernel hyperparameters need adjusting before you even touch the data.


6. GP Regression: Conditioning on Data

The central computation in GP regression is conditioning a joint Gaussian distribution on observed values. This is the same operation as Bayesian updating — but applied to the multivariate Gaussian that defines the GP.

6.1 The Setup

Suppose we have:

  • Training data: $N$ observations $(\mathbf{X}, \mathbf{y})$ where $y_i = f(x_i) + \varepsilon_i$ and $\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)$
  • Test inputs: $N_$ locations $\mathbf{X}_$ where we want predictions

The joint distribution of observed and predicted function values is:

$$\begin{bmatrix} \mathbf{y} \\ \mathbf{f}* \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \boldsymbol{\mu} \\ \boldsymbol{\mu}* \end{bmatrix}, \begin{bmatrix} \mathbf{K} + \sigma_n^2 \mathbf{I} & \mathbf{K}* \\ \mathbf{K}*^\top & \mathbf{K}_{**} \end{bmatrix} \right)$$

where:

  • $\mathbf{K} = k(\mathbf{X}, \mathbf{X})$ is the $N \times N$ kernel matrix between training points
  • $\mathbf{K}* = k(\mathbf{X}, \mathbf{X})$ is the $N \times N_$ kernel matrix between training and test points
  • $\mathbf{K}{**} = k(\mathbf{X}, \mathbf{X}_)$ is the $N_* \times N_*$ kernel matrix between test points
  • $\sigma_n^2 \mathbf{I}$ accounts for observation noise

6.2 The Conditioning Equations

Conditioning a multivariate Gaussian on a subset of its variables is a standard result (you derived the 2D version in lecture 2). The posterior predictive distribution is:

$$\mathbf{f}* \mid \mathbf{X}, \mathbf{y}, \mathbf{X}* \sim \mathcal{N}(\boldsymbol{\mu}*, \boldsymbol{\Sigma}*)$$

where:

$$\boxed{\boldsymbol{\mu}* = \mathbf{K}*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} (\mathbf{y} - \boldsymbol{\mu})}$$

$$\boxed{\boldsymbol{\Sigma}* = \mathbf{K}{**} - \mathbf{K}*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}*}$$

These two equations are the complete GP prediction. Their interpretation is clean:

  • Posterior mean $\boldsymbol{\mu}_*$: a weighted combination of the observed residuals $(\mathbf{y} - \boldsymbol{\mu})$, where the weights come from the kernel similarity between training and test points, modulated by the inverse of the training covariance.

  • Posterior covariance $\boldsymbol{\Sigma}*$: starts from the prior covariance $\mathbf{K}{**}$ and subtracts the information gained from the observations. Far from data, $\mathbf{K}*$ is small, so the posterior covariance is close to the prior. Near data, $\mathbf{K}*$ is large, so the posterior covariance is reduced — you are more certain.

ℹ️
The posterior variance does not depend on the observed $y$ values — only on the input locations $\mathbf{X}$ and $\mathbf{X}_*$. This means you can compute where your uncertainties will be large before collecting data. This is the basis for optimal experimental design with GPs.

6.3 Interactive Exploration: Conditioning in Action

The visualisation below shows GP conditioning in action. As data points are added, the GP posterior updates: the mean function is pulled toward the observations, and the uncertainty band narrows near the data while remaining wide in unobserved regions.

6.4 Implementation

import numpy as np
from scipy.spatial.distance import cdist

def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    """Squared exponential kernel."""
    sqdist = cdist(X1.reshape(-1, 1), X2.reshape(-1, 1), 'sqeuclidean')
    return variance * np.exp(-0.5 * sqdist / length_scale**2)

def gp_predict(X_train, y_train, X_test, length_scale, variance, noise_var):
    """GP regression: compute posterior mean and variance."""
    K = rbf_kernel(X_train, X_train, length_scale, variance)
    K += noise_var * np.eye(len(X_train))

    K_star = rbf_kernel(X_train, X_test, length_scale, variance)
    K_ss = rbf_kernel(X_test, X_test, length_scale, variance)

    # Solve K^{-1} y and K^{-1} K_* using Cholesky
    L = np.linalg.cholesky(K)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
    v = np.linalg.solve(L, K_star)

    mu_star = K_star.T @ alpha
    var_star = np.diag(K_ss) - np.sum(v**2, axis=0)

    return mu_star, var_star

# Example: 5 training points, predict on a fine grid
X_train = np.array([1.0, 3.0, 5.0, 6.0, 8.0])
y_train = np.sin(X_train) + 0.3 * np.sin(3 * X_train)
y_train += np.random.normal(0, 0.2, len(X_train))  # add noise

X_test = np.linspace(0, 10, 200)
sigma_n = 0.2

mu, var = gp_predict(X_train, y_train, X_test,
                     length_scale=1.0, variance=1.0, noise_var=sigma_n**2)
std = np.sqrt(var)
⚠️
Never compute $\mathbf{K}^{-1}$ explicitly. Use the Cholesky decomposition: factor $\mathbf{K} + \sigma_n^2\mathbf{I} = \mathbf{L}\mathbf{L}^\top$, then solve $\mathbf{L}\boldsymbol{\alpha}_1 = \mathbf{y}$ and $\mathbf{L}^\top\boldsymbol{\alpha} = \boldsymbol{\alpha}_1$ by forward/back substitution. This is numerically stable and costs $\mathcal{O}(N^3/3)$ for the factorisation plus $\mathcal{O}(N^2)$ per prediction. Computing the explicit inverse is both slower and less stable.

7. A Simple Example

7.1 Regression with Five Data Points

Five noisy observations of $f(x) = \sin(x) + 0.3\sin(3x)$ are used to predict the function across the full domain $[0, 10]$. The code in Section 6.4 and the plotting code below produce this example — run it yourself to see the GP posterior mean and uncertainty band.

7.2 Key Observations

Several features are worth noting:

  1. The posterior mean interpolates through (near) the data. The noise term $\sigma_n^2$ allows the mean to not pass exactly through each point — this is regularisation against overfitting to noisy observations.

  2. The uncertainty band widens away from data. Between $x = 8$ and $x = 10$, there is only one data point, and the uncertainty grows. Beyond $x = 10$, the GP reverts entirely to the prior: the mean goes to zero and the variance goes to $\sigma_f^2$. This is the correct behaviour — the model honestly reports ignorance.

  3. The uncertainty band narrows near data. At $x = 5$, where there is an observation, the posterior variance is close to zero (up to the noise variance).

  4. The GP captures the shape of the true function despite not being told the functional form. It “discovers” the shape purely from the smoothness encoded in the kernel and the observed data.

7.3 Plotting the Posterior

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))

# Posterior mean and credible interval
ax.fill_between(X_test, mu - 2*std, mu + 2*std,
                alpha=0.3, color='steelblue', label='95% credible interval')
ax.plot(X_test, mu, '-', color='steelblue', linewidth=2, label='GP posterior mean')

# True function
x_true = np.linspace(0, 10, 500)
y_true = np.sin(x_true) + 0.3 * np.sin(3 * x_true)
ax.plot(x_true, y_true, '--', color='red', linewidth=2, alpha=0.7, label='True function')

# Observations
ax.errorbar(X_train, y_train, yerr=sigma_n, fmt='ko',
            markersize=8, capsize=4, label='Observations')

ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend(loc='upper left')
ax.set_title('GP Regression: Posterior Mean and Uncertainty')
plt.tight_layout()

8. Connections to What You Already Know

8.1 GPs and Linear Models

The weight-space derivation in Section 2 shows that a GP with a linear kernel $k(x, x’) = \sigma_w^2 x x’$ is exactly equivalent to Bayesian linear regression with a Gaussian prior on the slope. Everything from lecture 5 is a special case. The GP simply allows richer kernels that correspond to infinite-dimensional feature spaces.

8.2 GPs and Chi-Squared

The GP log-marginal likelihood (Section 2.5) contains the term $(\mathbf{y} - \boldsymbol{\mu})^\top (\mathbf{C} + \mathbf{K})^{-1} (\mathbf{y} - \boldsymbol{\mu})$. This is the same quadratic form as the chi-squared statistic from lecture 5, but with the covariance matrix $\mathbf{C}$ replaced by $\mathbf{C} + \mathbf{K}$. The kernel $\mathbf{K}$ adds covariance — it says “these data points are correlated, so do not be surprised if they deviate from the mean in the same direction.” Chi-squared treats each residual as independent; the GP does not.

8.3 GPs and Priors

The kernel is the prior. Just as choosing a flat prior vs. a Gaussian prior on a slope parameter changes the posterior in Bayesian regression, choosing a squared exponential kernel vs. a Matern kernel changes the GP posterior. The kernel hyperparameters ($\ell$, $\sigma_f^2$, $P$, $\nu$, etc.) are the analogue of prior hyperparameters — they must be set, optimised, or marginalised over. The next lecture addresses this.


9. Summary

A Gaussian process is a distribution over functions, fully specified by a mean function $m(x)$ and a covariance function (kernel) $k(x, x’)$. Any finite collection of function values is jointly Gaussian.

The weight-space view connects GPs to linear models: a GP is the infinite-basis-function limit of a Bayesian linear model, made computationally tractable by the kernel trick. The marginalised log-probability contains a built-in complexity penalty (the Occam factor).

The function-space view provides the geometric intuition: each input point is a dimension of a multivariate Gaussian, and the kernel controls the correlation between dimensions. Observations restrict the set of plausible functions by conditioning.

The kernel encodes prior beliefs about the functions: their smoothness (via $\ell$), their amplitude (via $\sigma_f^2$), their periodicity (via $P$), and their differentiability (via $\nu$ in the Matern family). Kernels can be composed by addition and multiplication.

GP regression conditions the joint Gaussian on observed data. The posterior mean is a weighted combination of observations, and the posterior variance quantifies uncertainty. Far from data, the posterior reverts to the prior — the GP honestly reports ignorance.

The key equations for prediction are: $$\boldsymbol{\mu}* = \mathbf{K}^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$$ $$\boldsymbol{\Sigma}_ = \mathbf{K}{**} - \mathbf{K}^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}_$$

The next lecture covers the practical questions: how to set the kernel hyperparameters, what it costs computationally, when GPs fail, and how astronomers use them in the real world.


Further Reading

  • Rasmussen, C. E. & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. [The definitive reference — freely available at gaussianprocess.org/gpml]
  • Roberts, S., Osborne, M., Ebden, M., et al. (2013). Gaussian processes for time-series modelling. Phil. Trans. R. Soc. A, 371, 20110550.
  • Görtler, J., Kehlbeck, R., & Deussen, O. (2019). A Visual Exploration of Gaussian Processes. Distill. [The interactive visualisations embedded in this lecture are from this article, licensed under Apache 2.0]
  • Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D. W., & O’Neil, M. (2015). Fast direct methods for Gaussian processes. IEEE TPAMI, 38, 252.
Last updated on