Gaussian Processes II

Practical GP Regression and Applications

The previous lecture built the mathematical framework: a GP is a distribution over functions, the kernel encodes prior beliefs, and conditioning on data gives predictions with calibrated uncertainties. This lecture turns that framework into a practical tool. We address the three questions that arise the moment you try to use a GP on real data: how do you set the kernel hyperparameters? (Section 2), what does it cost computationally? (Section 5), and what are the failure modes? (Section 6). We then work through two extended astronomical examples — atmospheric CO$_2$ fitting with a composite kernel (Section 3) and stellar activity modelling for exoplanet detection (Section 4) — that demonstrate the full workflow from kernel design through hyperparameter optimisation to posterior prediction.

1. The Practical GP Workflow

A complete GP analysis follows a workflow that mirrors the Bayesian workflow from lecture 4, adapted for the function-space setting:

Choose a kernel that encodes your prior beliefs about the data: smooth trend? periodic component? multiple scales? Start simple.
Set initial hyperparameters from physical reasoning (what length scale and amplitude are plausible?) and check with prior samples.
Optimise hyperparameters by maximising the log-marginal likelihood (Type-II maximum likelihood) — or sample them with MCMC for full uncertainty propagation.
Predict at test locations using the conditioning equations from the previous lecture.
Check the results: do posterior samples look like the data? Are the residuals consistent with the noise model? Does the uncertainty make physical sense?

The rest of this lecture works through each step in detail.

2. Hyperparameter Optimisation

2.1 The Log-Marginal Likelihood

The kernel hyperparameters — length scale $\ell$, signal variance $\sigma_f^2$, noise variance $\sigma_n^2$, period $P$, etc. — control the GP's behaviour. We need a principled way to set them. The answer is the log-marginal likelihood (also called the log-evidence), which we derived in the previous lecture:

$$\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\psi}) = -\frac{1}{2}\mathbf{y}^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} - \frac{1}{2}\log|\mathbf{K} + \sigma_n^2 \mathbf{I}| - \frac{N}{2}\log 2\pi$$

where $\boldsymbol{\psi} = (\ell, \sigma_f^2, \sigma_n^2, \ldots)$ are the hyperparameters. This expression balances three terms:

Term	Role	What it rewards
$-\frac{1}{2}\mathbf{y}^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$	Data fit	Models that explain the observations
$-\frac{1}{2}\log\\|\mathbf{K} + \sigma_n^2 \mathbf{I}\\|$	Complexity penalty	Simpler models (smaller effective covariance)
$-\frac{N}{2}\log 2\pi$	Normalisation	(constant, independent of hyperparameters)

The data-fit term favours flexible models that can fit the data closely. The complexity penalty term favours simple models with small covariance. The balance between them is an automatic Occam's razor — exactly the same mechanism we discussed for Bayesian model comparison in lecture 9, but applied continuously to the hyperparameter space rather than discretely to a set of models.

ℹ️

The marginal likelihood is "marginal" because the function values $\mathbf{f}$ have been marginalised out. You are not optimising a likelihood over function values — you are optimising the probability of the data given the hyperparameters, with the infinite-dimensional function integrated away. This is why the GP can be flexible without overfitting: the complexity is controlled by the hyperparameters, not by the (effectively infinite) number of function values.

2.2 Gradients of the Marginal Likelihood

The marginal likelihood is a smooth function of the hyperparameters, and its gradient can be computed analytically. For a hyperparameter $\psi_j$:

$$\frac{\partial}{\partial \psi_j} \log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\psi}) = \frac{1}{2}\text{tr}\left[(\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - (\mathbf{K} + \sigma_n^2\mathbf{I})^{-1})\frac{\partial (\mathbf{K} + \sigma_n^2\mathbf{I})}{\partial \psi_j}\right]$$

where $\boldsymbol{\alpha} = (\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y}$. This enables gradient-based optimisation methods like L-BFGS-B, which are much more efficient than gradient-free methods for this problem.

2.3 Type-II ML vs Full Bayesian Inference

Optimising the marginal likelihood gives point estimates of the hyperparameters — the single set of values that maximises $\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\psi})$. This is called Type-II maximum likelihood (or empirical Bayes). It is fast and works well when the data strongly constrain the hyperparameters.

The alternative is full Bayesian inference: place priors on the hyperparameters and use MCMC to sample from the posterior $p(\boldsymbol{\psi} \mid \mathbf{y}, \mathbf{X})$. Predictions are then averaged over hyperparameter samples, which properly propagates hyperparameter uncertainty into the predictions.

Figure 6: Left: Type-II ML gives a single point estimate of the length scale, ignoring the width of the marginal likelihood peak. Right: full Bayesian inference via MCMC samples the posterior over the length scale, capturing the full uncertainty. The difference matters most when the data are sparse or when the marginal likelihood has multiple modes.

When should you use each approach?

Situation	Recommendation
Large dataset, well-constrained hyperparameters	Type-II ML is sufficient
Small dataset, uncertain hyperparameters	Full Bayesian (MCMC)
Marginal likelihood has multiple modes	Full Bayesian (MCMC)
Computational budget is tight	Type-II ML
Downstream decisions depend on uncertainty	Full Bayesian (MCMC)

2.4 Practical Optimisation

import numpy as np
from scipy.optimize import minimize
from scipy.spatial.distance import cdist

def neg_log_marginal_likelihood(log_params, X, y):
    """Negative log marginal likelihood for RBF kernel.

    Parameters are in log-space to enforce positivity.
    """
    length_scale = np.exp(log_params[0])
    variance = np.exp(log_params[1])
    noise_var = np.exp(log_params[2])

    N = len(X)
    sqdist = cdist(X.reshape(-1, 1), X.reshape(-1, 1), 'sqeuclidean')
    K = variance * np.exp(-0.5 * sqdist / length_scale**2)
    K_y = K + noise_var * np.eye(N)

    # Cholesky decomposition
    try:
        L = np.linalg.cholesky(K_y)
    except np.linalg.LinAlgError:
        return 1e10  # return large value if not positive definite

    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))

    # Log marginal likelihood
    log_ml = -0.5 * y @ alpha
    log_ml -= np.sum(np.log(np.diag(L)))
    log_ml -= 0.5 * N * np.log(2 * np.pi)

    return -log_ml  # negative because we minimise

# Optimise
X_train = np.array([1.0, 3.0, 5.0, 6.0, 8.0])
y_train = np.sin(X_train) + np.random.normal(0, 0.2, len(X_train))

# Initial guess: log(length_scale), log(variance), log(noise_var)
log_p0 = np.array([0.0, 0.0, -2.0])

result = minimize(neg_log_marginal_likelihood, log_p0,
                  args=(X_train, y_train), method='L-BFGS-B')

ell_opt, var_opt, noise_opt = np.exp(result.x)
print(f"Optimised: ell={ell_opt:.3f}, var={var_opt:.3f}, noise={noise_opt:.3f}")

⚠️

Always optimise in log-space. The hyperparameters $\ell$, $\sigma_f^2$, and $\sigma_n^2$ must be positive. Optimising their logarithms enforces this constraint without needing bounded optimisation. This also improves the conditioning of the optimisation problem, since the marginal likelihood often varies over many orders of magnitude in the hyperparameters.

⚠️

The marginal likelihood can have local optima. Particularly for composite kernels with many hyperparameters, L-BFGS-B may find a local maximum that is not the global one. Run the optimiser from multiple random initialisations and compare results. If the best optima have similar marginal likelihood values but very different hyperparameters, this is a sign that full Bayesian inference over hyperparameters is needed.

3. Extended Example: Atmospheric CO$_2$

This example, adapted from the george documentation (Foreman-Mackey 2015), demonstrates the full GP workflow on a dataset with rich structure: atmospheric CO$_2$ measurements showing a long-term trend, seasonal oscillations, and irregular short-term variations.

3.1 The Data

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.datasets import co2

data = co2.load_pandas().data
t = 2000 + (np.array(data.index.to_julian_date()) - 2451545.0) / 365.25
y = np.array(data.co2)

# Clean and subsample
mask = np.isfinite(t) & np.isfinite(y) & (t < 1996)
t, y = t[mask][::4], y[mask][::4]

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(t, y, '.k', markersize=3)
ax.set_xlabel('Year')
ax.set_ylabel(r'CO$_2$ (ppm)')
ax.set_title('Mauna Loa CO$_2$ Record')
plt.tight_layout()

The data show three distinct features: (1) a smooth, accelerating upward trend, (2) a seasonal oscillation with period 1 year, and (3) irregular short-term variability. A single kernel cannot capture all three — we need a composite kernel.

3.2 Designing the Kernel

We construct a kernel with four components, each targeting a different feature of the data:

$$k(x, x') = k_1(x, x') + k_2(x, x') + k_3(x, x') + k_4(x, x')$$

Component 1 — Long-term trend (squared exponential with long length scale):

$$k_1(r) = \theta_1^2 \exp\left(-\frac{r^2}{2\theta_2^2}\right)$$

This captures the smooth, multi-decade rise in CO$_2$. The length scale $\theta_2$ should be many years.

Component 2 — Seasonal variation (squared exponential $\times$ periodic):

$$k_2(r) = \theta_3^2 \exp\left(-\frac{r^2}{2\theta_4^2} - \frac{\theta_5 \sin^2(\pi r / \theta_6)}{1}\right)$$

This is a quasi-periodic kernel: periodic with period $\theta_6 \approx 1$ year, modulated by an envelope of length scale $\theta_4$. The parameter $\theta_5$ controls the "sharpness" of the periodicity.

Component 3 — Medium-term irregularities (rational quadratic):

$$k_3(r) = \theta_7^2 \left(1 + \frac{r^2}{2\theta_8 \theta_9^2}\right)^{-\theta_8}$$

The rational quadratic kernel is equivalent to an infinite mixture of squared exponential kernels with different length scales. It captures structure at multiple scales.

Component 4 — Short-term noise (squared exponential with short length scale + white noise):

$$k_4(r) = \theta_{10}^2 \exp\left(-\frac{r^2}{2\theta_{11}^2}\right) + \theta_{12}^2 \delta_{ij}$$

This handles short-term correlated noise and uncorrelated measurement noise.

ℹ️

This kernel has 12 hyperparameters. That sounds like a lot — but each one has a clear physical interpretation, and the marginal likelihood provides a principled way to set them all simultaneously. The alternative — choosing a parametric model with 12 parameters that captures a nonlinear trend, seasonal oscillations, multi-scale variability, and noise — would be much harder to justify physically.

3.3 Implementation with `george`

import george
from george import kernels

# Component 1: long-term smooth trend
k1 = 66.0**2 * kernels.ExpSquaredKernel(metric=67.0**2)

# Component 2: seasonal (quasi-periodic)
k2 = 2.4**2 * kernels.ExpSquaredKernel(90.0**2) \
     * kernels.ExpSine2Kernel(gamma=2.0 / 1.3**2, log_period=0.0)

# Component 3: medium-term irregularities (rational quadratic)
k3 = 0.66**2 * kernels.RationalQuadraticKernel(
     log_alpha=np.log(0.78), metric=1.2**2)

# Component 4: short-term noise
k4 = 0.18**2 * kernels.ExpSquaredKernel(1.6**2)

kernel = k1 + k2 + k3 + k4

gp = george.GP(kernel, mean=np.mean(y), fit_mean=True,
               white_noise=np.log(0.19**2), fit_white_noise=True)
gp.compute(t)
print(f"Initial log-likelihood: {gp.log_likelihood(y):.2f}")

3.4 Optimising the Hyperparameters

import scipy.optimize as op

def nll(p):
    gp.set_parameter_vector(p)
    ll = gp.log_likelihood(y, quiet=True)
    return -ll if np.isfinite(ll) else 1e25

def grad_nll(p):
    gp.set_parameter_vector(p)
    return -gp.grad_log_likelihood(y, quiet=True)

p0 = gp.get_parameter_vector()
result = op.minimize(nll, p0, jac=grad_nll, method="L-BFGS-B")
gp.set_parameter_vector(result.x)

print(f"Optimised log-likelihood: {gp.log_likelihood(y):.2f}")

3.5 Predictions from Optimised Hyperparameters

With the hyperparameters fixed at their optimised values, predictions are straightforward:

x_pred = np.linspace(max(t), 2025, 2000)
mu, var = gp.predict(y, x_pred, return_var=True)
std = np.sqrt(var)

fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(t, y, s=1, c='k', label='Observations')
ax.fill_between(x_pred, mu - std, mu + std, color='g', alpha=0.5,
                label=r'$\pm 1\sigma$')
ax.plot(x_pred, mu, 'g-', linewidth=1)
ax.set_xlim(t.min(), 2025)
ax.set_xlabel('Year')
ax.set_ylabel(r'CO$_2$ (ppm)')
ax.legend()
plt.tight_layout()

The GP correctly extrapolates the trend and seasonal pattern into the future, with uncertainty that grows with distance from the last observation. However, the uncertainty bands from Type-II ML are narrower than they should be, because they condition on a single set of hyperparameters.

3.6 Full Bayesian Inference with MCMC

To properly account for hyperparameter uncertainty, we sample the posterior over hyperparameters with emcee:

import emcee

def lnprob(p):
    # Broad uniform prior on log-hyperparameters
    if np.any((-100 > p[1:]) + (p[1:] > 100)):
        return -np.inf
    gp.set_parameter_vector(p)
    return gp.log_likelihood(y, quiet=True)

gp.compute(t)
ndim = len(gp)
nwalkers = 36

p0 = gp.get_parameter_vector() + 1e-4 * np.random.randn(nwalkers, ndim)

sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob)
p0, _, _ = sampler.run_mcmc(p0, 200)  # burn-in
sampler.run_mcmc(p0, 200)             # production

# Predict from posterior samples
fig, ax = plt.subplots(figsize=(10, 5))
x_pred = np.linspace(max(t), 2025, 250)

for i in range(50):
    w = np.random.randint(sampler.chain.shape[0])
    n = np.random.randint(sampler.chain.shape[1])
    gp.set_parameter_vector(sampler.chain[w, n])
    ax.plot(x_pred, gp.sample_conditional(y, x_pred), 'g', alpha=0.1)

ax.scatter(t, y, c='k', s=1)
ax.set_xlim(t.min(), 2025)
ax.set_xlabel('Year')
ax.set_ylabel(r'CO$_2$ (ppm)')
ax.set_title('Posterior Predictions (MCMC over hyperparameters)')
plt.tight_layout()

ℹ️

The MCMC predictions show wider uncertainty than the point-estimate predictions. This is because each draw uses a different set of hyperparameters — some with longer length scales (producing smoother extrapolations), some with shorter ones (producing more variable extrapolations). The spread of green lines is the honest uncertainty, accounting for the fact that we do not know the kernel hyperparameters exactly.

4. Astronomical Application: Stellar Activity and Exoplanet Detection

4.1 The Problem

Radial velocity (RV) measurements are corrupted by signals from the host star's surface activity — spots, faculae, and convective motions imprint quasi-periodic variations with amplitudes comparable to or larger than planetary signals. The stellar rotation period sets the characteristic timescale, but the signal is not perfectly periodic: active regions evolve, appear, and disappear on timescales of weeks to months.

This is a textbook application for GPs. The stellar activity signal is:

Quasi-periodic: it has a characteristic period (the stellar rotation) but is not exactly repeating.
Correlated: consecutive observations are not independent.
Nuisance: we want to model it to remove it, not to study it.

4.2 The Quasi-Periodic Kernel for Stellar Activity

The standard kernel for stellar activity in RV work is a quasi-periodic kernel:

$$k_{\text{QP}}(t, t') = \eta_1^2 \exp\left(-\frac{(t - t')^2}{2\eta_2^2} - \frac{2\sin^2(\pi |t - t'| / \eta_3)}{\eta_4^2}\right)$$

The four hyperparameters are:

$\eta_1$: amplitude of the activity signal (m/s)
$\eta_2$: evolution timescale — how quickly the active regions change (days)
$\eta_3$: stellar rotation period (days)
$\eta_4$: smoothness of the periodic component — controls how sinusoidal the rotation signal is

The total model for the RV time series is:

$$\text{RV}(t) = v_0 + K\cos\left(\frac{2\pi t}{P_{\text{planet}}} + \phi\right) + f_{\text{GP}}(t) + \varepsilon(t)$$

where $f_{\text{GP}}(t)$ is the GP component with the quasi-periodic kernel. The planet parameters $(K, P_{\text{planet}}, \phi)$ are inferred simultaneously with the GP hyperparameters.

Figure 7: Top: simulated RV data containing both a planetary signal (red) and stellar activity (orange). The observed data (black points) show the combined signal, making the planet hard to detect by eye. Bottom: after subtracting the GP model of stellar activity, the planetary signal is recovered cleanly. The GP captures the quasi-periodic stellar signal without absorbing the Keplerian orbit.

4.3 Why Not Just Filter?

You might wonder: why not just apply a high-pass filter or subtract a running mean? The answer is that such approaches do not properly propagate uncertainty. A GP:

Models the correlation structure — it knows that a data point 5 days from now is more predictable than one 50 days from now.
Propagates uncertainty — the detrended RV inherits uncertainty from the imperfect knowledge of the activity model.
Jointly fits the planet and activity — rather than removing the activity first and fitting the planet second (which biases both), the GP fits everything simultaneously.

⚠️

GPs can absorb real signals. If the planet period is close to the stellar rotation period (or a harmonic), the GP may attribute the planetary signal to stellar activity. This is a fundamental limitation, not a bug — when two signals are indistinguishable in the data, no method can separate them. Always check whether the GP is absorbing your signal of interest by comparing fits with and without the planet component.

5. Computational Considerations

5.1 The $\mathcal{O}(N^3)$ Bottleneck

The dominant cost in GP regression is the Cholesky decomposition of the $N \times N$ matrix $\mathbf{K} + \sigma_n^2 \mathbf{I}$, which scales as $\mathcal{O}(N^3)$. Storage is $\mathcal{O}(N^2)$. For $N = 1{,}000$, this is fine (milliseconds). For $N = 10{,}000$, it takes seconds. For $N = 100{,}000$, it is prohibitive on a single machine.

$N$	Cholesky time	Storage
$10^2$	~ms	~80 KB
$10^3$	~100 ms	~8 MB
$10^4$	~10 s	~800 MB
$10^5$	~3 hours	~80 GB

5.2 Scalable Approximations

Several approaches reduce the cost:

Sparse/inducing-point methods. Choose $M \ll N$ "inducing points" $\mathbf{Z}$ and approximate the full GP by conditioning on these points. Cost: $\mathcal{O}(NM^2)$. Examples: FITC, VFE (variational free energy). Implemented in GPy, GPflow, and GPyTorch.

Structured kernel interpolation (SKI/KISS-GP). Exploit Kronecker and Toeplitz structure in the kernel matrix when data are on a grid or near-grid. Cost: $\mathcal{O}(N \log N)$ or even $\mathcal{O}(N)$. Implemented in GPyTorch.

Celerite. For one-dimensional inputs (time series!) and kernels that can be written as a mixture of exponentials, exact GP computations can be done in $\mathcal{O}(N)$ time and $\mathcal{O}(N)$ storage. This is the method of choice for astronomical time-series analysis (light curves, radial velocities, power spectra). The celerite2 package implements this.

$$k(\tau) = \sum_j \left[a_j \cos(c_j \tau) + b_j \sin(c_j \tau)\right] e^{-d_j \tau}$$

Any kernel that can be written in this form — including the quasi-periodic kernel commonly used for stellar activity — admits $\mathcal{O}(N)$ exact inference.

ℹ️

For most astronomical time-series problems, use celerite2. It provides exact GP inference (no approximations) in linear time. The limitation is that it only works for 1D inputs and a restricted (but large) family of kernels. For multi-dimensional inputs or exotic kernels, use sparse approximations via GPyTorch or tinygp.

5.3 Numerical Stability

Even when $N$ is manageable, numerical issues can arise:

Ill-conditioned kernel matrices. When data points are very close together relative to the length scale, the kernel matrix has near-duplicate rows, leading to small eigenvalues and numerical instability. Fix: add a jitter term ($10^{-6}$ to $10^{-8}$ on the diagonal).
Overflow/underflow in the log-determinant. The log-determinant $\log|\mathbf{K}|$ can be very large or very small. Always compute it as $2\sum_i \log L_{ii}$ from the Cholesky factor, never via np.linalg.det.
Gradient instability. Near local optima, the gradients of the marginal likelihood can have large condition numbers. Use double precision and consider transforming hyperparameters to log-space.

6. Limitations and When Not to Use GPs

6.1 GPs Are Expensive

Even with celerite2, GPs are more expensive than parametric models. If you know the functional form (e.g., a Keplerian orbit, a blackbody spectrum), use it. GPs are most valuable when the functional form is unknown or when there is a nuisance signal whose parametric form you cannot write down.

6.2 GPs Assume Stationarity (Usually)

Most commonly used kernels are stationary — they assume the function has the same statistical properties everywhere. If the noise level changes with time, or the signal amplitude grows, a stationary GP will either overfit the high-variability regions or underfit the low-variability ones. Solutions include input warping, non-stationary kernels, or splitting the data into segments.

6.3 GPs Extrapolate Poorly

A GP with a stationary kernel reverts to the prior mean with prior variance as you move away from the data. This is honest — but it means the GP cannot extrapolate trends. The CO$_2$ example works because the long-length-scale kernel component captures the trend within the training range and projects it forward. But this projection is entirely driven by the kernel — it has no physical basis. If the true trend changes character, the GP will not predict it.

6.4 Kernel Misspecification

The kernel is the prior. If you use a squared exponential kernel but the true function is rough (e.g., a step function), the GP will produce poor predictions with overconfident uncertainty bands. Conversely, if you use a Matern-1/2 kernel for a very smooth function, you waste information. There is no free lunch — you must think about what kernel is appropriate for your data.

6.5 The Curse of Dimensionality

GPs work well in 1D (time series) and reasonably well in 2D or 3D (spatial data). In higher dimensions, the kernel matrix becomes increasingly sparse (most points are "far" from each other), and the data requirements grow exponentially. For high-dimensional problems, neural network-based approaches or structured kernels are usually necessary.

7. Connections to Other Methods

7.1 GPs and Splines

Cubic splines are a special case of GP regression with a specific kernel (a polynomial kernel of degree 3). The "smoothing parameter" in spline fitting corresponds to the ratio of noise variance to signal variance in the GP. The GP generalisation provides uncertainty quantification that splines do not.

7.2 GPs and Kriging

In geostatistics, GP regression is called kriging. The semivariogram used in kriging is directly related to the kernel: $\gamma(r) = k(0) - k(r)$. If you encounter kriging in the spatial statistics literature, it is the same thing with different notation.

7.3 GPs and Neural Networks

In the limit of infinitely wide layers, Bayesian neural networks converge to Gaussian processes (Neal 1996). This connection has driven much recent research on "neural tangent kernels" and infinite-width limits. In practice, GPs and neural networks occupy different niches: GPs for small-to-medium datasets where calibrated uncertainty matters, neural networks for large datasets where raw predictive performance matters.

7.4 GPs and Fourier Methods

For stationary kernels, the kernel function and the power spectral density are Fourier transform pairs (Bochner's theorem). A squared exponential kernel corresponds to a Gaussian power spectrum. A Matern-3/2 kernel corresponds to a power spectrum that falls off as $\omega^{-4}$. This connection allows you to design kernels from knowledge of the power spectrum of your signal.

8. Another Astronomical Application: Light Curve Interpolation

Supernova cosmology requires estimating the peak brightness of each supernova from irregularly sampled photometry, often with gaps during bad weather or when the object is behind the Sun. GPs provide a principled way to interpolate across these gaps with honest uncertainty.

Figure 8: GP interpolation of a supernova light curve with an observing gap (red shaded region) around peak brightness. The GP (blue line) interpolates smoothly across the gap, with the uncertainty band (shaded blue) widening where data are absent. The estimated peak time and magnitude are marked. The GP does not assume a parametric light curve template — the shape is learned from the data and the kernel.

The key advantage of GPs for light curve interpolation is that the uncertainty in the interpolated peak brightness includes the uncertainty from the gap. A template-fitting approach would give you a peak brightness estimate, but its uncertainty would depend on the assumed template shape. The GP uncertainty depends only on the data and the kernel — it automatically accounts for the fact that less data means more uncertainty.

import numpy as np
from scipy.spatial.distance import cdist
from scipy.special import kv, gamma as gamma_func

def matern_kernel(X1, X2, nu=2.5, length_scale=5.0, variance=0.5):
    """Matern kernel for supernova light curve smoothness."""
    dist = cdist(X1.reshape(-1, 1), X2.reshape(-1, 1), 'euclidean')
    dist = np.maximum(dist, 1e-8)
    const = 2**(1 - nu) / gamma_func(nu)
    scaled = np.sqrt(2 * nu) * dist / length_scale
    return variance * const * scaled**nu * kv(nu, scaled)

# Observations (irregular cadence with gap around peak)
t_obs = np.sort(np.concatenate([
    np.random.uniform(0, 15, 8),
    np.random.uniform(25, 50, 12)
]))
# ... (simulate photometry, fit GP, predict through gap)

9. GP Software Ecosystem

A brief guide to the major GP libraries relevant for astronomical work:

Library	Language	Strengths	Best for
`george`	Python (C++)	Simple API, fast for moderate $N$	Learning, moderate datasets
`celerite2`	Python/C++/JAX	$\mathcal{O}(N)$ for 1D time series	Astronomical time series
`GPyTorch`	Python (PyTorch)	Scalable, GPU support, variational	Large datasets, GPU
`tinygp`	Python (JAX)	JAX-native, composable, fast	JAX-based workflows
`scikit-learn`	Python	Easy to use, well-documented	Quick prototyping
`GPy`	Python	Feature-rich, many kernels	Research, exploration
`GPflow`	Python (TensorFlow)	Variational inference, scalable	Production, large data

For this course, george or tinygp are recommended for learning, and celerite2 for any serious time-series analysis.

10. Summary

Hyperparameter optimisation is performed by maximising the log-marginal likelihood, which automatically balances data fit against model complexity. The gradient is available analytically, enabling efficient optimisation with L-BFGS-B. For full uncertainty propagation, use MCMC over the hyperparameters.

The CO$_2$ example demonstrates composite kernel design: a sum of four kernels, each targeting a different feature of the data (trend, seasonality, multi-scale irregularities, noise). The marginal likelihood optimisation sets 12+ hyperparameters simultaneously, and MCMC provides honest uncertainty in predictions.

Stellar activity modelling uses a quasi-periodic kernel to separate correlated stellar signals from planetary signals in radial velocity data. The GP fits the activity and planet simultaneously, properly propagating uncertainty from the imperfect activity model into the planet parameter estimates.

Computational cost scales as $\mathcal{O}(N^3)$ for general kernels. For 1D time series, celerite2 provides exact $\mathcal{O}(N)$ inference for a broad class of kernels. Sparse approximations extend GPs to larger datasets at the cost of approximation error.

Limitations: GPs are expensive, assume stationarity (usually), extrapolate poorly, and are sensitive to kernel choice. They are most valuable when the functional form is unknown, when calibrated uncertainties matter, and when the data are one- or low-dimensional.

The GP is not magic — it is a principled way to encode and update beliefs about smooth functions. The kernel is the prior, the conditioning equations are Bayes' theorem, and the marginal likelihood is the evidence. Everything connects back to the Bayesian framework you have been building all course.