Gaussian Processes I
From Parametric Models to Infinite-Dimensional Priors
Every model we have built so far has been parametric: you specify a functional form — a line, a polynomial, a mixture of Gaussians — and the data determine the values of a finite set of parameters. The functional form itself is fixed by assumption. Gaussian processes (GPs) remove this constraint. A GP places a prior probability over functions, not over parameters, and lets the data decide which functions are plausible. The result is a flexible, probabilistic model that provides calibrated uncertainty estimates everywhere — including in regions far from any observation.
This lecture builds up the GP framework from first principles. We start from the familiar territory of linear models (lecture 7), show how marginalising over weights leads naturally to the kernel trick and the GP, then develop the function-space view that provides the geometric intuition. We introduce kernels, draw samples from the prior, and derive the conditioning equations that produce predictions given data. The next lecture covers practical implementation: hyperparameter optimisation, computational costs, and real astronomical applications.
1. Why Go Non-Parametric?
1.1 The Limits of Parametric Models
In lectures 5 and 7, we fitted straight lines and polynomial models to data. These are powerful when the true underlying relationship really is (approximately) a polynomial. But consider three situations where they break down:
Unknown functional form. A stellar light curve during a flare has no simple closed-form expression. You could try a sum of exponentials, but how many? With what constraints?
Nuisance signals. Stellar activity imprints quasi-periodic, slowly-varying signals on radial velocity data. You want to model this correlated noise to remove it — but writing down a parametric model for “stuff the star does” is, at best, fragile.
Interpolation with honest uncertainties. You observe a supernova at irregular cadence and need to interpolate the light curve to estimate peak brightness. A polynomial can interpolate, but its uncertainty estimates away from the data are controlled by the polynomial degree, not by the data themselves.
In each case, the problem is the same: a parametric model forces you to commit to a functional form before seeing the data. If that form is wrong, the uncertainties you derive are wrong too — not just in magnitude, but in kind.
1.2 The Non-Parametric Alternative
A Gaussian process sidesteps the functional-form problem entirely. Instead of choosing $f(x) = mx + b$ or $f(x) = \sum_k w_k \phi_k(x)$ and fitting coefficients, a GP defines a probability distribution over functions. The “prior” is not over a parameter vector $\boldsymbol{\theta}$ but over the space of all smooth functions $f: \mathbb{R} \to \mathbb{R}$ (subject to regularity conditions encoded by the kernel).
2. The Weight-Space View
We arrive at Gaussian processes by starting from something you already know: a parametric linear model with a prior on its weights. The path is: linear model → Bayesian linear model → marginalise over weights → kernel trick → Gaussian process.
2.1 From Linear Models to Feature Spaces
In lecture 7 we wrote the linear model as:
$$f(\mathbf{x}) = \mathbf{A}\mathbf{w}$$
where $\mathbf{A}$ is the $N \times M$ design matrix of basis functions and $\mathbf{w}$ is the $M$-dimensional weight vector. The basis functions could be polynomials ($1, x, x^2, \ldots$), Fourier modes, radial basis functions, or anything else.
Place a Gaussian prior on the weights: $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Lambda})$, where $\boldsymbol{\Lambda}$ is the prior covariance. The log-probability of the data is:
$$\log p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}, \mathbf{w}) = -\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu}(\boldsymbol{\theta}) - \mathbf{A}\mathbf{w})^\top \mathbf{C}^{-1} (\mathbf{y} - \boldsymbol{\mu}(\boldsymbol{\theta}) - \mathbf{A}\mathbf{w}) - \frac{1}{2}\log|2\pi\mathbf{C}|$$
where $\mathbf{C}$ is the data covariance and $\boldsymbol{\mu}(\boldsymbol{\theta})$ is a mean model with parameters $\boldsymbol{\theta}$.
2.2 Marginalising Over the Weights
We do not want to infer $\mathbf{w}$ — it is a nuisance parameter. Marginalise it out:
$$p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}) = \int p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}, \mathbf{w}) p(\mathbf{w}) d\mathbf{w}$$
Both the likelihood and the prior are Gaussian, so the integral is analytic (a Gaussian integrated against a Gaussian is a Gaussian — the same trick from lecture 5, Section 9). The result is:
$$p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}) = \mathcal{N}\big(\mathbf{y} \big| \boldsymbol{\mu}(\boldsymbol{\theta}),; \mathbf{C} + \mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top\big)$$
The key quantity is the effective covariance:
$$\mathbf{K}_{\text{eff}} = \mathbf{C} + \mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top$$
The matrix $\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top$ is an $N \times N$ matrix whose $(i,j)$ entry is:
$$[\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top]{ij} = \sum{m=1}^{M} \sum_{m’=1}^{M} A_{im} \Lambda_{mm’} A_{jm’} = \boldsymbol{\phi}(x_i)^\top \boldsymbol{\Lambda} \boldsymbol{\phi}(x_j)$$
where $\boldsymbol{\phi}(x_i)$ is the vector of basis functions evaluated at $x_i$.
2.3 The Kernel Trick
Here is the insight that turns a finite-dimensional linear model into a Gaussian process. The entry $[\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top]_{ij}$ depends on the inputs $x_i$ and $x_j$ only through the inner product $\boldsymbol{\phi}(x_i)^\top \boldsymbol{\Lambda} \boldsymbol{\phi}(x_j)$. Define:
$$k(x_i, x_j) \equiv \boldsymbol{\phi}(x_i)^\top \boldsymbol{\Lambda} \boldsymbol{\phi}(x_j)$$
This is a kernel function — it takes two inputs and returns a scalar measuring their “similarity” in feature space. Now ask: what happens as $M \to \infty$? As the number of basis functions grows, you cannot explicitly construct $\mathbf{A}$ or $\boldsymbol{\Lambda}$. But you never need to — you only need the kernel $k(x_i, x_j)$, which replaces the inner product of two infinite-dimensional feature vectors with a single function evaluation.
$$[\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top]_{ij} \to k(x_i, x_j)$$
This is the kernel trick: replace an inner product of (potentially infinite-dimensional) feature vectors with a kernel function. The GP is the infinite-basis-function limit of a Bayesian linear model.
2.4 A Concrete Example: From Polynomials to Kernels
To make the kernel trick tangible, consider a quadratic basis: $\boldsymbol{\phi}(x) = (1, x, x^2)^\top$ with prior $\boldsymbol{\Lambda} = \text{diag}(\sigma_0^2, \sigma_1^2, \sigma_2^2)$. The kernel is:
$$k(x, x’) = \sigma_0^2 + \sigma_1^2 x x’ + \sigma_2^2 x^2 x’^2$$
This is just the covariance between $f(x)$ and $f(x’)$ that arises from the weight prior. For three basis functions, you could write out $\mathbf{A}$ and compute $\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}^\top$ explicitly. For 1000 RBF basis functions, you still could. For infinitely many? You cannot — but the kernel $k(x, x’)$ still exists and can be evaluated.
import numpy as np
def polynomial_kernel(x1, x2, sigmas=(1.0, 1.0, 0.5)):
"""Kernel from a quadratic basis with diagonal weight prior."""
s0, s1, s2 = sigmas
return s0**2 + s1**2 * np.outer(x1, x2) + s2**2 * np.outer(x1**2, x2**2)
# Verify: compare kernel matrix to A @ Lambda @ A.T
x = np.array([0.5, 1.0, 2.0, 3.0])
A = np.column_stack([np.ones_like(x), x, x**2])
Lambda = np.diag([1.0, 1.0, 0.25])
K_from_basis = A @ Lambda @ A.T
K_from_kernel = polynomial_kernel(x, x)
print("Max difference:", np.max(np.abs(K_from_basis - K_from_kernel)))
# 0.0 — they are identical2.5 The Marginalised Log-Probability
After the kernel trick, the marginalised probability becomes:
$$\log p(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{\psi}) = -\frac{1}{2}(\mathbf{y} - \boldsymbol{\mu})^\top (\mathbf{C} + \mathbf{K})^{-1} (\mathbf{y} - \boldsymbol{\mu}) - \frac{1}{2}\log|\mathbf{C} + \mathbf{K}| - \frac{N}{2}\log 2\pi$$
where $\mathbf{K}$ is the $N \times N$ kernel matrix with entries $K_{ij} = k(x_i, x_j)$. This expression has three terms:
- Data fit: $(\mathbf{y} - \boldsymbol{\mu})^\top (\mathbf{C} + \mathbf{K})^{-1} (\mathbf{y} - \boldsymbol{\mu})$ — how well the model fits the data, penalised by the total covariance.
- Complexity penalty: $\log|\mathbf{C} + \mathbf{K}|$ — penalises complex models that require a large covariance to fit the data. This is the built-in Occam’s razor.
- Normalisation: $\frac{N}{2}\log 2\pi$ — a constant.
3. The Function-Space View
The weight-space derivation shows where GPs come from. The function-space view provides the intuition for how they work.
3.1 A GP Is a Distribution Over Functions
The defining statement is:
A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
Concretely, suppose we have inputs $x_1, x_2, \ldots, x_N$. A GP says that the function values $(f(x_1), f(x_2), \ldots, f(x_N))$ are jointly Gaussian:
$$\begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_N) \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} m(x_1) \\ m(x_2) \\ \vdots \\ m(x_N) \end{bmatrix}, \begin{bmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_N) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_N) \\ \vdots & & \ddots & \vdots \\ k(x_N, x_1) & k(x_N, x_2) & \cdots & k(x_N, x_N) \end{bmatrix} \right)$$
The GP is fully specified by two things:
- A mean function $m(x) = \mathbb{E}[f(x)]$, which is often set to zero (or to a simple parametric trend subtracted before fitting).
- A covariance function (or kernel) $k(x, x’) = \text{Cov}[f(x), f(x’)]$, which encodes the prior belief about how function values at different inputs are correlated.
We write this compactly as:
$$f \sim \mathcal{GP}\big(m(x), k(x, x’)\big)$$
3.2 Intuition: Each Data Point Is a Dimension
Here is the key mental model. Think of $f(x_1)$ and $f(x_2)$ as two random variables — two axes of a multivariate Gaussian. If $x_1$ and $x_2$ are close together, the kernel $k(x_1, x_2)$ is large, meaning $f(x_1)$ and $f(x_2)$ are strongly correlated: knowing one tightly constrains the other. If $x_1$ and $x_2$ are far apart, the kernel is small, meaning the function values are nearly independent.
This is exactly how a multivariate Gaussian works — but instead of 2 or 10 dimensions, you have as many dimensions as you have input points. And the GP definition guarantees that this works for any finite collection, no matter how large.
3.3 Interactive Exploration: The Two-Dimensional Case
To build intuition, start with just two input points $x_1$ and $x_2$. The function values $f(x_1)$ and $f(x_2)$ form a bivariate Gaussian whose correlation is set by $k(x_1, x_2)$. When $x_1$ and $x_2$ are close, the correlation is high: dragging $f(x_1)$ up pulls $f(x_2)$ up with it. When they are far apart, the correlation vanishes and the values move independently.
The interactive visualisation below (from Görtler, Kehlbeck & Deussen, 2019) lets you explore this directly. On the left you see the bivariate Gaussian distribution; on the right, those two values are plotted as function values at their respective $x$ positions. Drag the points and watch how the covariance structure constrains the function.
3.4 Observations Restrict the Set of Possible Functions
Before seeing data, the GP prior assigns probability to every smooth function consistent with the kernel. The prior is broad: many different functions are plausible. When you observe $y_1 = f(x_1) + \varepsilon_1$, you are conditioning the joint distribution on the observed value at $x_1$. This eliminates all functions that do not pass near $y_1$ at $x_1$. Each additional observation further restricts the set of plausible functions.
This is the function-space analogue of Bayes’ theorem. The prior over functions (encoded by the kernel) is updated by the likelihood of the data to produce a posterior over functions. The posterior mean is the “best prediction,” and the posterior variance quantifies the remaining uncertainty.
The interactive visualisation below lets you draw your own GP prior samples. Adjust the kernel hyperparameters and watch how the character of the sampled functions changes — each draw is one function from the GP prior.
4. Kernels: Encoding Prior Beliefs
The kernel $k(x, x’)$ is the heart of a Gaussian process. It determines the properties of the functions the GP considers plausible: their smoothness, their amplitude, their characteristic length scales, and whether they are periodic.
4.1 The Squared Exponential (RBF) Kernel
The most common kernel is the squared exponential (also called the radial basis function or RBF kernel):
$$k_{\text{SE}}(x, x’) = \sigma_f^2 \exp\left(-\frac{(x - x’)^2}{2\ell^2}\right)$$
It has two hyperparameters:
- $\sigma_f^2$: the signal variance — controls the overall amplitude of the function. Larger $\sigma_f^2$ means the GP considers functions with larger excursions from the mean.
- $\ell$: the length scale — controls the “wiggliness.” Nearby points (within a distance $\ell$) are strongly correlated; distant points are nearly independent.
Properties: Functions drawn from a GP with a squared exponential kernel are infinitely differentiable — they are extremely smooth. This is appropriate for many physical processes but can be too smooth for some applications (e.g., rough terrain, turbulence).
4.2 The Effect of the Length Scale
The length scale $\ell$ is arguably the single most important hyperparameter. It controls how quickly the correlation between function values decays with distance:
- Short $\ell$: functions vary rapidly, the GP can capture fine structure, but predictions far from data revert quickly to the prior mean.
- Long $\ell$: functions are smooth and slowly varying, the GP interpolates broadly between data points, but cannot capture sharp features.
4.3 The Matern Kernel
The Matern family generalises the squared exponential by introducing a smoothness parameter $\nu$:
$$k_{\text{Matern}}(r) = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu} r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu} r}{\ell}\right)$$
where $r = |x - x’|$ and $K_\nu$ is the modified Bessel function of the second kind. The parameter $\nu$ controls differentiability:
- $\nu = 1/2$: Matern-1/2 — produces functions that are continuous but not differentiable (like Brownian motion). Equivalent to the exponential kernel $k(r) = \sigma_f^2 \exp(-r/\ell)$.
- $\nu = 3/2$: Matern-3/2 — once differentiable. A good default for many physical processes.
- $\nu = 5/2$: Matern-5/2 — twice differentiable. Smooth but not infinitely so.
- $\nu \to \infty$: recovers the squared exponential.
4.4 The Periodic Kernel
For data with a known periodicity (stellar rotation, orbital motion, seasonal variation):
$$k_{\text{Per}}(x, x’) = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pi |x - x’| / P)}{\ell^2}\right)$$
where $P$ is the period. This kernel produces functions that are exactly periodic with period $P$, with the length scale $\ell$ controlling the smoothness within each period.
4.5 Kernel Summary
| Kernel | Formula | Key hyperparameters |
|---|---|---|
| Squared Exponential (RBF) | $k(x, x’) = \sigma_f^2 \exp\left(-\frac{(x - x’)^2}{2\ell^2}\right)$ | $\sigma_f^2$ (amplitude), $\ell$ (length scale) |
| Matern-3/2 | $k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{3} r}{\ell}\right) \exp\left(-\frac{\sqrt{3} r}{\ell}\right)$ | $\sigma_f^2$, $\ell$ |
| Matern-5/2 | $k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{5} r}{\ell} + \frac{5 r^2}{3\ell^2}\right) \exp\left(-\frac{\sqrt{5} r}{\ell}\right)$ | $\sigma_f^2$, $\ell$ |
| Periodic | $k(x, x’) = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pi | x - x' |
| Linear | $k(x, x’) = \sigma_f^2 (x - c)(x’ - c)$ | $\sigma_f^2$, $c$ (hinge point) |
The interactive figure below lets you compare prior samples from these kernels. Adjust the hyperparameter sliders and watch how the character of the sampled functions changes.
4.6 Stationary vs Non-Stationary Kernels
A kernel is stationary if it depends only on the difference $r = x - x’$, not on the absolute positions $x$ and $x’$. All the kernels above are stationary. Stationary kernels assume the function has the same statistical properties everywhere — the same smoothness, the same amplitude, the same correlation length.
Non-stationary kernels — where $k(x, x’)$ depends on the positions themselves — allow different behaviour in different regions. These are more flexible but harder to specify and fit.
4.7 Combining Kernels
Individual kernels can be combined to express richer structure:
Sum: $k(x, x’) = k_1(x, x’) + k_2(x, x’)$ — the GP models the sum of two independent processes. Use this to capture a smooth long-term trend plus short-term oscillations.
Product: $k(x, x’) = k_1(x, x’) \times k_2(x, x’)$ — the correlation must be high in both components simultaneously. Use this to create locally periodic signals that decay over time (a quasi-periodic kernel is $k_{\text{Per}} \times k_{\text{SE}}$).
Explore kernel combinations interactively below:
5. Samples from the Prior
Drawing samples from a GP prior is straightforward and provides a powerful check on whether your kernel encodes sensible prior beliefs.
5.1 How to Sample
Given $N$ test inputs $\mathbf{x}_* = (x_1^, x_2^, \ldots, x_N^*)$:
- Compute the $N \times N$ covariance matrix $\mathbf{K}{**}$ with entries $[K{**}]_{ij} = k(x_i^, x_j^)$.
- Compute the Cholesky decomposition $\mathbf{K}_{**} = \mathbf{L}\mathbf{L}^\top$ (this is the matrix “square root”).
- Draw a vector of $N$ independent standard normal samples: $\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
- Compute $\mathbf{f}* = m(\mathbf{x}*) + \mathbf{L}\mathbf{u}$.
The resulting $\mathbf{f}_*$ is a draw from the GP prior. Repeating steps 3–4 gives multiple prior samples.
import numpy as np
from scipy.spatial.distance import cdist
from scipy.linalg import cholesky
def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
"""Squared exponential (RBF) kernel."""
sqdist = cdist(X1.reshape(-1, 1), X2.reshape(-1, 1), 'sqeuclidean')
return variance * np.exp(-0.5 * sqdist / length_scale**2)
# Test inputs
x_star = np.linspace(0, 10, 200)
# Covariance matrix from kernel
K = rbf_kernel(x_star, x_star, length_scale=1.0, variance=1.0)
K += 1e-8 * np.eye(len(x_star)) # numerical stability (jitter)
# Cholesky decomposition
L = cholesky(K, lower=True)
# Draw 5 samples from the GP prior
n_samples = 5
u = np.random.randn(len(x_star), n_samples)
f_samples = L @ u # each column is one function draw5.2 Why the Cholesky Decomposition?
The Cholesky decomposition plays the same role here as it does in sampling from any multivariate Gaussian. Recall from lecture 2 that if $\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, then $\mathbf{L}\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{L}\mathbf{L}^\top) = \mathcal{N}(\mathbf{0}, \mathbf{K})$. The Cholesky factor $\mathbf{L}$ transforms uncorrelated noise into noise with the correct covariance structure.
5.3 Prior Predictive Checks for GPs
Just as we performed prior predictive checks for parametric models in lecture 4, you should always visualise GP prior samples before fitting data. If the prior samples look wildly different from what you expect the data to look like — functions varying on the wrong scale, with the wrong amplitude, or with the wrong smoothness — then your kernel hyperparameters need adjusting before you even touch the data.
6. GP Regression: Conditioning on Data
The central computation in GP regression is conditioning a joint Gaussian distribution on observed values. This is the same operation as Bayesian updating — but applied to the multivariate Gaussian that defines the GP.
6.1 The Setup
Suppose we have:
- Training data: $N$ observations $(\mathbf{X}, \mathbf{y})$ where $y_i = f(x_i) + \varepsilon_i$ and $\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)$
- Test inputs: $N_$ locations $\mathbf{X}_$ where we want predictions
The joint distribution of observed and predicted function values is:
$$\begin{bmatrix} \mathbf{y} \\ \mathbf{f}* \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \boldsymbol{\mu} \\ \boldsymbol{\mu}* \end{bmatrix}, \begin{bmatrix} \mathbf{K} + \sigma_n^2 \mathbf{I} & \mathbf{K}* \\ \mathbf{K}*^\top & \mathbf{K}_{**} \end{bmatrix} \right)$$
where:
- $\mathbf{K} = k(\mathbf{X}, \mathbf{X})$ is the $N \times N$ kernel matrix between training points
- $\mathbf{K}* = k(\mathbf{X}, \mathbf{X})$ is the $N \times N_$ kernel matrix between training and test points
- $\mathbf{K}{**} = k(\mathbf{X}, \mathbf{X}_)$ is the $N_* \times N_*$ kernel matrix between test points
- $\sigma_n^2 \mathbf{I}$ accounts for observation noise
6.2 The Conditioning Equations
Conditioning a multivariate Gaussian on a subset of its variables is a standard result (you derived the 2D version in lecture 2). The posterior predictive distribution is:
$$\mathbf{f}* \mid \mathbf{X}, \mathbf{y}, \mathbf{X}* \sim \mathcal{N}(\boldsymbol{\mu}*, \boldsymbol{\Sigma}*)$$
where:
$$\boxed{\boldsymbol{\mu}* = \mathbf{K}*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} (\mathbf{y} - \boldsymbol{\mu})}$$
$$\boxed{\boldsymbol{\Sigma}* = \mathbf{K}{**} - \mathbf{K}*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}*}$$
These two equations are the complete GP prediction. Their interpretation is clean:
Posterior mean $\boldsymbol{\mu}_*$: a weighted combination of the observed residuals $(\mathbf{y} - \boldsymbol{\mu})$, where the weights come from the kernel similarity between training and test points, modulated by the inverse of the training covariance.
Posterior covariance $\boldsymbol{\Sigma}*$: starts from the prior covariance $\mathbf{K}{**}$ and subtracts the information gained from the observations. Far from data, $\mathbf{K}*$ is small, so the posterior covariance is close to the prior. Near data, $\mathbf{K}*$ is large, so the posterior covariance is reduced — you are more certain.
6.3 Interactive Exploration: Conditioning in Action
The visualisation below shows GP conditioning in action. As data points are added, the GP posterior updates: the mean function is pulled toward the observations, and the uncertainty band narrows near the data while remaining wide in unobserved regions.
6.4 Implementation
import numpy as np
from scipy.spatial.distance import cdist
def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
"""Squared exponential kernel."""
sqdist = cdist(X1.reshape(-1, 1), X2.reshape(-1, 1), 'sqeuclidean')
return variance * np.exp(-0.5 * sqdist / length_scale**2)
def gp_predict(X_train, y_train, X_test, length_scale, variance, noise_var):
"""GP regression: compute posterior mean and variance."""
K = rbf_kernel(X_train, X_train, length_scale, variance)
K += noise_var * np.eye(len(X_train))
K_star = rbf_kernel(X_train, X_test, length_scale, variance)
K_ss = rbf_kernel(X_test, X_test, length_scale, variance)
# Solve K^{-1} y and K^{-1} K_* using Cholesky
L = np.linalg.cholesky(K)
alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
v = np.linalg.solve(L, K_star)
mu_star = K_star.T @ alpha
var_star = np.diag(K_ss) - np.sum(v**2, axis=0)
return mu_star, var_star
# Example: 5 training points, predict on a fine grid
X_train = np.array([1.0, 3.0, 5.0, 6.0, 8.0])
y_train = np.sin(X_train) + 0.3 * np.sin(3 * X_train)
y_train += np.random.normal(0, 0.2, len(X_train)) # add noise
X_test = np.linspace(0, 10, 200)
sigma_n = 0.2
mu, var = gp_predict(X_train, y_train, X_test,
length_scale=1.0, variance=1.0, noise_var=sigma_n**2)
std = np.sqrt(var)7. A Simple Example
7.1 Regression with Five Data Points
Five noisy observations of $f(x) = \sin(x) + 0.3\sin(3x)$ are used to predict the function across the full domain $[0, 10]$. The code in Section 6.4 and the plotting code below produce this example — run it yourself to see the GP posterior mean and uncertainty band.
7.2 Key Observations
Several features are worth noting:
The posterior mean interpolates through (near) the data. The noise term $\sigma_n^2$ allows the mean to not pass exactly through each point — this is regularisation against overfitting to noisy observations.
The uncertainty band widens away from data. Between $x = 8$ and $x = 10$, there is only one data point, and the uncertainty grows. Beyond $x = 10$, the GP reverts entirely to the prior: the mean goes to zero and the variance goes to $\sigma_f^2$. This is the correct behaviour — the model honestly reports ignorance.
The uncertainty band narrows near data. At $x = 5$, where there is an observation, the posterior variance is close to zero (up to the noise variance).
The GP captures the shape of the true function despite not being told the functional form. It “discovers” the shape purely from the smoothness encoded in the kernel and the observed data.
7.3 Plotting the Posterior
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# Posterior mean and credible interval
ax.fill_between(X_test, mu - 2*std, mu + 2*std,
alpha=0.3, color='steelblue', label='95% credible interval')
ax.plot(X_test, mu, '-', color='steelblue', linewidth=2, label='GP posterior mean')
# True function
x_true = np.linspace(0, 10, 500)
y_true = np.sin(x_true) + 0.3 * np.sin(3 * x_true)
ax.plot(x_true, y_true, '--', color='red', linewidth=2, alpha=0.7, label='True function')
# Observations
ax.errorbar(X_train, y_train, yerr=sigma_n, fmt='ko',
markersize=8, capsize=4, label='Observations')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend(loc='upper left')
ax.set_title('GP Regression: Posterior Mean and Uncertainty')
plt.tight_layout()8. Connections to What You Already Know
8.1 GPs and Linear Models
The weight-space derivation in Section 2 shows that a GP with a linear kernel $k(x, x’) = \sigma_w^2 x x’$ is exactly equivalent to Bayesian linear regression with a Gaussian prior on the slope. Everything from lecture 5 is a special case. The GP simply allows richer kernels that correspond to infinite-dimensional feature spaces.
8.2 GPs and Chi-Squared
The GP log-marginal likelihood (Section 2.5) contains the term $(\mathbf{y} - \boldsymbol{\mu})^\top (\mathbf{C} + \mathbf{K})^{-1} (\mathbf{y} - \boldsymbol{\mu})$. This is the same quadratic form as the chi-squared statistic from lecture 5, but with the covariance matrix $\mathbf{C}$ replaced by $\mathbf{C} + \mathbf{K}$. The kernel $\mathbf{K}$ adds covariance — it says “these data points are correlated, so do not be surprised if they deviate from the mean in the same direction.” Chi-squared treats each residual as independent; the GP does not.
8.3 GPs and Priors
The kernel is the prior. Just as choosing a flat prior vs. a Gaussian prior on a slope parameter changes the posterior in Bayesian regression, choosing a squared exponential kernel vs. a Matern kernel changes the GP posterior. The kernel hyperparameters ($\ell$, $\sigma_f^2$, $P$, $\nu$, etc.) are the analogue of prior hyperparameters — they must be set, optimised, or marginalised over. The next lecture addresses this.
9. Summary
A Gaussian process is a distribution over functions, fully specified by a mean function $m(x)$ and a covariance function (kernel) $k(x, x’)$. Any finite collection of function values is jointly Gaussian.
The weight-space view connects GPs to linear models: a GP is the infinite-basis-function limit of a Bayesian linear model, made computationally tractable by the kernel trick. The marginalised log-probability contains a built-in complexity penalty (the Occam factor).
The function-space view provides the geometric intuition: each input point is a dimension of a multivariate Gaussian, and the kernel controls the correlation between dimensions. Observations restrict the set of plausible functions by conditioning.
The kernel encodes prior beliefs about the functions: their smoothness (via $\ell$), their amplitude (via $\sigma_f^2$), their periodicity (via $P$), and their differentiability (via $\nu$ in the Matern family). Kernels can be composed by addition and multiplication.
GP regression conditions the joint Gaussian on observed data. The posterior mean is a weighted combination of observations, and the posterior variance quantifies uncertainty. Far from data, the posterior reverts to the prior — the GP honestly reports ignorance.
The key equations for prediction are: $$\boldsymbol{\mu}* = \mathbf{K}^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$$ $$\boldsymbol{\Sigma}_ = \mathbf{K}{**} - \mathbf{K}^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}_$$
The next lecture covers the practical questions: how to set the kernel hyperparameters, what it costs computationally, when GPs fail, and how astronomers use them in the real world.
Further Reading
- Rasmussen, C. E. & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. [The definitive reference — freely available at gaussianprocess.org/gpml]
- Roberts, S., Osborne, M., Ebden, M., et al. (2013). Gaussian processes for time-series modelling. Phil. Trans. R. Soc. A, 371, 20110550.
- Görtler, J., Kehlbeck, R., & Deussen, O. (2019). A Visual Exploration of Gaussian Processes. Distill. [The interactive visualisations embedded in this lecture are from this article, licensed under Apache 2.0]
- Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D. W., & O’Neil, M. (2015). Fast direct methods for Gaussian processes. IEEE TPAMI, 38, 252.