Comparing And Selecting Models
All models are wrong. Some are useful. But which ones?
Throughout this course you have built models of increasing sophistication: straight lines (lecture 5), polynomial and basis-function expansions (lecture 7), Gaussian processes (lectures 10–11), hierarchical models (lecture 12), and mixture models (lecture 13). Each time, the scientific question was about the parameters within a model. But there is a prior question that we have been deferring: given two or more competing models, which one should you prefer?
This is the model comparison problem, and it is unavoidable. Is the radial velocity signal from a planet or from stellar activity? Does a colour-magnitude diagram need two Gaussian components or three? Is a quadratic fit justified over a straight line, or is the improvement just noise fitting? Every time you add a parameter, change a likelihood, or swap one functional form for another, you are implicitly comparing models. This lecture makes that comparison explicit and principled.
The core tension is simple: complex models fit better, but simple models generalise better. Every framework for model comparison—Bayesian evidence, information criteria, cross-validation—is a different way of navigating that tension. Understanding when they agree, when they disagree, and why is the point of this lecture.
1. Why Model Comparison Is Hard
1.1 More Parameters Always Improve the Fit
Consider fitting a polynomial of degree $d$ to $N$ data points. As $d$ increases, the residual sum of squares decreases monotonically—a degree-$(N-1)$ polynomial passes through every data point exactly, giving $\chi^2 = 0$. By the metric of fit quality alone, the most complex model always wins.
This is not a useful criterion. A degree-19 polynomial fit to 20 noisy points has memorised the noise, not learned the signal. Its predictions at new $x$ values will be catastrophically wrong.
1.2 The Maximum Likelihood Is Not Enough
You might hope to compare models by their maximum log-likelihood:
$$\log \mathcal{L}_\text{max}(\mathcal{M}1) \quad \text{vs.} \quad \log \mathcal{L}\text{max}(\mathcal{M}_2)$$
But the model with more parameters will always achieve a higher (or equal) maximum likelihood, because it has more freedom to fit the data. The maximum likelihood does not penalise complexity—it rewards it.
1.3 What We Need
A model comparison framework must balance two competing demands:
- Goodness of fit: the model should explain the observed data
- Parsimony: the model should not use more structure than the data require
Every method in this lecture implements this balance differently. Bayesian evidence integrates over the prior volume. Information criteria add an explicit penalty term. Cross-validation tests predictive performance on held-out data. They are not interchangeable—each encodes different assumptions—but for well-specified problems they usually agree.
2. Bayesian Model Comparison
2.1 The Posterior Over Models
Bayes’ theorem applies to models just as it applies to parameters. If $\mathcal{M}_k$ is a model and $\mathcal{D}$ is the observed data:
$$p(\mathcal{M}_k \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \mathcal{M}_k)p(\mathcal{M}_k)}{p(\mathcal{D})}$$
where:
- $p(\mathcal{M}_k \mid \mathcal{D})$ is the posterior probability of the model
- $p(\mathcal{D} \mid \mathcal{M}_k)$ is the marginal likelihood (or evidence) for model $\mathcal{M}_k$
- $p(\mathcal{M}_k)$ is the prior probability of the model
- $p(\mathcal{D}) = \sum_k p(\mathcal{D} \mid \mathcal{M}_k) p(\mathcal{M}_k)$ is a normalising constant
The evidence is the quantity that does all the work. It is the probability of the observed data, averaged over the prior on the parameters:
$$p(\mathcal{D} \mid \mathcal{M}_k) = \int p(\mathcal{D} \mid \boldsymbol{\theta}_k, \mathcal{M}_k)p(\boldsymbol{\theta}_k \mid \mathcal{M}_k)\mathrm{d}\boldsymbol{\theta}_k$$
This is the same integral that appears as the normalising constant in Bayes’ theorem for parameters (lecture 4, Section 1.2). We ignored it then because it does not depend on $\boldsymbol{\theta}$. Now it is the central object of interest, because it does depend on the model.
2.2 The Bayes Factor
To compare two models, take the ratio of their posterior probabilities:
$$\frac{p(\mathcal{M}_2 \mid \mathcal{D})}{p(\mathcal{M}_1 \mid \mathcal{D})} = \frac{p(\mathcal{D} \mid \mathcal{M}_2)}{p(\mathcal{D} \mid \mathcal{M}_1)} \cdot \frac{p(\mathcal{M}_2)}{p(\mathcal{M}_1)}$$
The first factor on the right is the Bayes factor:
$$B_{21} = \frac{p(\mathcal{D} \mid \mathcal{M}_2)}{p(\mathcal{D} \mid \mathcal{M}_1)}$$
If we assign equal prior probabilities to both models ($p(\mathcal{M}_1) = p(\mathcal{M}_2)$), the posterior odds ratio equals the Bayes factor. In practice, equal model priors are the default unless you have strong reason to prefer one model a priori.
2.3 Interpreting Bayes Factors: The Jeffreys Scale
Harold Jeffreys proposed a scale for interpreting Bayes factors that remains widely used:
| $\log_{10} B_{21}$ | $B_{21}$ | Strength of evidence for $\mathcal{M}_2$ |
|---|---|---|
| $< 0$ | $< 1$ | Favours $\mathcal{M}_1$ |
| $0$ to $0.5$ | $1$ to $3$ | Barely worth mentioning |
| $0.5$ to $1$ | $3$ to $10$ | Substantial |
| $1$ to $2$ | $10$ to $100$ | Strong |
| $> 2$ | $> 100$ | Decisive |
2.4 A Worked Example: Linear vs. Quadratic
To make this concrete, consider fitting $N = 20$ data points generated from a true quadratic $y = 3 + 0.5x + 0.02x^2 + \varepsilon$, with $\varepsilon \sim \mathcal{N}(0, 1.75^2)$.
Model 1 (linear): $y = \theta_0 + \theta_1 x + \varepsilon$, with priors $\theta_0, \theta_1 \sim \mathcal{N}(0, 100^2)$.
Model 2 (quadratic): $y = \theta_0 + \theta_1 x + \theta_2 x^2 + \varepsilon$, with priors $\theta_0, \theta_1, \theta_2 \sim \mathcal{N}(0, 100^2)$.
import numpy as np
from scipy.integrate import quad, dblquad, tplquad
from scipy.stats import norm
# Generate data from a quadratic
np.random.seed(42)
N = 20
sigma_y = 1.75
x = np.sort(np.random.uniform(-5, 5, N))
y_true = 3.0 + 0.5 * x + 0.02 * x**2
y = y_true + np.random.normal(0, sigma_y, N)
def log_likelihood(y, y_pred, sigma):
return np.sum(norm.logpdf(y, loc=y_pred, scale=sigma))
# --- Evidence for the linear model (2D integral) ---
def integrand_linear(theta0, theta1):
y_pred = theta0 + theta1 * x
ll = log_likelihood(y, y_pred, sigma_y)
lp = norm.logpdf(theta0, 0, 100) + norm.logpdf(theta1, 0, 100)
return np.exp(ll + lp)
evidence_linear, err_linear = dblquad(
integrand_linear,
-10, 10, # theta1 limits (guided by data)
-10, 20, # theta0 limits
)
# --- Evidence for the quadratic model (3D integral) ---
def integrand_quadratic(theta0, theta1, theta2):
y_pred = theta0 + theta1 * x + theta2 * x**2
ll = log_likelihood(y, y_pred, sigma_y)
lp = (norm.logpdf(theta0, 0, 100) + norm.logpdf(theta1, 0, 100)
+ norm.logpdf(theta2, 0, 100))
return np.exp(ll + lp)
evidence_quadratic, err_quadratic = tplquad(
integrand_quadratic,
-5, 5, # theta2 limits
-10, 10, # theta1 limits
-10, 20, # theta0 limits
)
B_21 = evidence_quadratic / evidence_linear
print(f"log Evidence (linear): {np.log(evidence_linear):.2f}")
print(f"log Evidence (quadratic): {np.log(evidence_quadratic):.2f}")
print(f"Bayes factor B_21: {B_21:.2f}")
print(f"log10(B_21): {np.log10(B_21):.2f}")3. Computing the Evidence
The evidence integral
$$p(\mathcal{D} \mid \mathcal{M}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}, \mathcal{M})p(\boldsymbol{\theta} \mid \mathcal{M})\mathrm{d}\boldsymbol{\theta}$$
is the central computational challenge in Bayesian model comparison. Unlike parameter estimation, where MCMC can avoid computing the normalising constant, model comparison requires the normalising constant itself. And this integral is hard—often harder than the parameter estimation problem that motivated it.
3.1 Why It Is Hard
The difficulty is dimensional. The prior typically occupies a large volume in parameter space, but the likelihood is sharply peaked in a much smaller region. The evidence is the average of the likelihood over the prior, which means most of the prior volume contributes negligibly to the integral. Naive Monte Carlo sampling from the prior is therefore hopelessly inefficient for all but the simplest models: you would need to draw an astronomically large number of prior samples before any of them landed in the likelihood peak.
3.2 The Harmonic Mean Estimator (and Why It Fails)
Given posterior samples ${\boldsymbol{\theta}s}{s=1}^{S}$ from MCMC, a tempting identity gives:
$$\frac{1}{p(\mathcal{D} \mid \mathcal{M})} = \int \frac{p(\boldsymbol{\theta} \mid \mathcal{M})}{p(\mathcal{D} \mid \boldsymbol{\theta}, \mathcal{M})p(\boldsymbol{\theta} \mid \mathcal{M})}p(\boldsymbol{\theta} \mid \mathcal{D}, \mathcal{M})\mathrm{d}\boldsymbol{\theta} = \mathbb{E}_{\text{post}}\left[\frac{1}{p(\mathcal{D} \mid \boldsymbol{\theta}, \mathcal{M})}\right]$$
This suggests the estimator:
$$\hat{p}(\mathcal{D} \mid \mathcal{M}) = \left[\frac{1}{S}\sum_{s=1}^{S} \frac{1}{p(\mathcal{D} \mid \boldsymbol{\theta}_s, \mathcal{M})}\right]^{-1}$$
which is the harmonic mean of the likelihoods evaluated at posterior samples. It is unbiased and can be computed from any set of posterior samples.
3.3 Nested Sampling
Nested sampling, introduced by John Skilling (2004), is the most widely used method for evidence computation in astronomy. The key insight is to transform the multi-dimensional evidence integral into a one-dimensional integral over the prior volume.
Define the prior volume enclosed within a likelihood contour $\mathcal{L}^* = L$:
$$X(L) = \int_{\mathcal{L}(\boldsymbol{\theta}) > L} p(\boldsymbol{\theta} \mid \mathcal{M})\mathrm{d}\boldsymbol{\theta}$$
As $L$ increases from 0 to $\mathcal{L}_\text{max}$, $X$ decreases from 1 to 0. The evidence is then:
$$p(\mathcal{D} \mid \mathcal{M}) = \int_0^1 L(X)\mathrm{d}X$$
This is a one-dimensional integral that can be approximated numerically. The algorithm works by maintaining a set of $n_\text{live}$ “live points” drawn from the prior, iteratively replacing the lowest-likelihood point with a new point drawn from the prior but constrained to have higher likelihood. At each iteration, the prior volume shrinks by a factor of approximately $n_\text{live} / (n_\text{live} + 1)$.
# Conceptual nested sampling (simplified)
# In practice, use dynesty, UltraNest, or MultiNest
def nested_sampling_simple(log_likelihood_fn, prior_sample_fn,
n_live=400, max_iter=10000):
"""
Simplified nested sampling for pedagogical purposes.
Parameters
----------
log_likelihood_fn : callable
Returns log-likelihood given parameter vector
prior_sample_fn : callable
Returns a sample drawn uniformly from the prior
n_live : int
Number of live points
max_iter : int
Maximum iterations
"""
# Initialise live points from the prior
live_points = np.array([prior_sample_fn() for _ in range(n_live)])
live_logl = np.array([log_likelihood_fn(p) for p in live_points])
log_evidence = -np.inf
log_X = 0.0 # log of remaining prior volume
dead_points = []
dead_logl = []
for i in range(max_iter):
# Find worst live point
worst = np.argmin(live_logl)
L_threshold = live_logl[worst]
# Shrink the prior volume
log_shrink = -1.0 / n_live
log_width = log_X + np.log(1 - np.exp(log_shrink))
log_X += log_shrink
# Accumulate evidence
log_evidence = np.logaddexp(log_evidence,
L_threshold + log_width)
dead_points.append(live_points[worst].copy())
dead_logl.append(L_threshold)
# Replace worst point with a new sample above threshold
# (In practice, this is the hard part — use constrained
# prior sampling, e.g., ellipsoidal decomposition)
while True:
new_point = prior_sample_fn()
new_logl = log_likelihood_fn(new_point)
if new_logl > L_threshold:
live_points[worst] = new_point
live_logl[worst] = new_logl
break
# Convergence check
max_remaining = np.max(live_logl) + log_X
if max_remaining < log_evidence - 5: # negligible remaining
break
return log_evidence, np.array(dead_points), np.array(dead_logl)dynesty, UltraNest, and MultiNest/PyMultiNest implement sophisticated strategies (ellipsoidal decomposition, slice sampling, neural network proposals) to do this efficiently in high dimensions. The conceptual algorithm above illustrates the logic but would fail for non-trivial problems due to the rejection sampling step becoming prohibitively expensive.3.4 Thermodynamic Integration
Thermodynamic integration (also called power posteriors or path sampling) connects the prior to the posterior via a continuous “temperature” parameter $\beta \in [0, 1]$:
$$p_\beta(\boldsymbol{\theta} \mid \mathcal{D}) \propto p(\mathcal{D} \mid \boldsymbol{\theta})^\betap(\boldsymbol{\theta})$$
At $\beta = 0$, this is the prior. At $\beta = 1$, this is the posterior. The log-evidence is:
$$\log p(\mathcal{D} \mid \mathcal{M}) = \int_0^1 \mathbb{E}_{\beta}\left[\log p(\mathcal{D} \mid \boldsymbol{\theta})\right] \mathrm{d}\beta$$
where $\mathbb{E}\beta[\cdot]$ denotes the expectation under $p\beta$. In practice, you run MCMC chains at a sequence of temperatures $0 = \beta_0 < \beta_1 < \cdots < \beta_T = 1$, compute $\langle \log \mathcal{L} \rangle_{\beta_t}$ at each temperature, and numerically integrate.
def thermodynamic_integration(log_likelihood_fn, log_prior_fn,
sampler_fn, betas, n_samples=2000):
"""
Estimate log-evidence via thermodynamic integration.
Parameters
----------
log_likelihood_fn : callable
Log-likelihood function
log_prior_fn : callable
Log-prior function
sampler_fn : callable
sampler_fn(log_target, n_samples) returns samples from
a distribution proportional to exp(log_target)
betas : array
Temperature schedule from 0 to 1
n_samples : int
Samples per temperature
"""
mean_logl = np.zeros(len(betas))
for i, beta in enumerate(betas):
# Define the tempered log-posterior
def log_target(theta):
return beta * log_likelihood_fn(theta) + log_prior_fn(theta)
# Sample from the tempered distribution
samples = sampler_fn(log_target, n_samples)
# Compute mean log-likelihood at this temperature
logl_values = np.array([log_likelihood_fn(s) for s in samples])
mean_logl[i] = np.mean(logl_values)
# Numerical integration (trapezoidal rule)
log_evidence = np.trapz(mean_logl, betas)
return log_evidenceThe advantage of thermodynamic integration is that it reuses MCMC machinery you already have. The disadvantage is that it requires many MCMC runs (one per temperature), so the total computational cost can be high.
3.5 The Laplace Approximation
For models where the posterior is approximately Gaussian (which it often is for well-constrained linear or mildly non-linear models), the evidence can be approximated analytically. Expand the log-posterior around its maximum $\hat{\boldsymbol{\theta}}$:
$$\log p(\mathcal{D} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}) \approx \log p(\mathcal{D} \mid \hat{\boldsymbol{\theta}}) p(\hat{\boldsymbol{\theta}}) - \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^\top \mathbf{H} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})$$
where $\mathbf{H}$ is the Hessian of the negative log-posterior at the maximum. The Gaussian integral then gives:
$$p(\mathcal{D} \mid \mathcal{M}) \approx p(\mathcal{D} \mid \hat{\boldsymbol{\theta}})p(\hat{\boldsymbol{\theta}})(2\pi)^{d/2}|\mathbf{H}|^{-1/2}$$
where $d$ is the number of parameters. Taking the logarithm:
$$\log p(\mathcal{D} \mid \mathcal{M}) \approx \log p(\mathcal{D} \mid \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$
This is fast (requires only the MAP estimate and the Hessian) and often surprisingly accurate. It fails when the posterior is multimodal, highly skewed, or has heavy tails. For the GP marginal likelihood (lecture 11), this is exactly the form we used—the GP log-marginal likelihood already is the Laplace approximation to the evidence with the function values marginalised out analytically.
4. Information Criteria
When computing the full evidence is too expensive, information criteria provide quick approximations. They all share the same structure: a goodness-of-fit term (the maximum log-likelihood) minus a complexity penalty. The criteria differ in how they define the penalty.
4.1 The Akaike Information Criterion (AIC)
The AIC is derived from an asymptotic approximation to the expected Kullback-Leibler divergence between the fitted model and the true data-generating distribution:
$$\text{AIC} = -2 \log \mathcal{L}_\text{max} + 2p$$
where $p$ is the number of free parameters and $\mathcal{L}\text{max} = p(\mathcal{D} \mid \hat{\boldsymbol{\theta}}\text{MLE})$ is the maximum likelihood.
Key properties:
- The penalty is $2p$, independent of sample size $N$.
- It estimates out-of-sample predictive accuracy, not the probability that the model is correct.
- It is not consistent: as $N \to \infty$, if the true model is among the candidates, AIC does not converge to selecting it with probability 1 (it tends to select overly complex models).
- It is efficient: among the candidate models, AIC asymptotically selects the one that minimises the prediction error, even if none of the candidates is correct.
For small samples, the corrected version is preferred:
$$\text{AIC}_c = \text{AIC} + \frac{2p(p+1)}{N - p - 1}$$
which adds a stronger penalty when $N$ is not much larger than $p$.
4.2 The Bayesian Information Criterion (BIC)
The BIC arises from the Laplace approximation to the log-evidence:
$$\text{BIC} = -2 \log \mathcal{L}_\text{max} + p \log N$$
Key properties:
- The penalty grows with $N$ (as $p \log N$), so BIC penalises complexity more strongly than AIC for $N \geq 8$.
- It approximates $-2 \log p(\mathcal{D} \mid \mathcal{M})$ up to a constant that does not depend on the model.
- It is consistent: as $N \to \infty$, BIC selects the true model with probability 1 (if it is in the candidate set).
- It is not efficient: it can select a model with worse predictive performance than the one AIC picks.
def compute_aic_bic(log_likelihood_max, n_params, n_data):
"""Compute AIC, AICc, and BIC."""
aic = -2 * log_likelihood_max + 2 * n_params
aicc = aic + 2 * n_params * (n_params + 1) / (n_data - n_params - 1)
bic = -2 * log_likelihood_max + n_params * np.log(n_data)
return aic, aicc, bic
# Example: compare polynomial fits of degree 1, 2, 3
for degree in [1, 2, 3]:
coeffs = np.polyfit(x, y, degree, w=1.0/np.full(N, sigma_y))
y_pred = np.polyval(coeffs, x)
log_L = np.sum(norm.logpdf(y, loc=y_pred, scale=sigma_y))
p = degree + 1 # number of coefficients
aic, aicc, bic = compute_aic_bic(log_L, p, N)
print(f"Degree {degree}: log L = {log_L:.1f}, "
f"AIC = {aic:.1f}, AICc = {aicc:.1f}, BIC = {bic:.1f}")4.3 The Deviance Information Criterion (DIC)
The DIC was introduced for hierarchical models where the “number of parameters” is not well defined (e.g., in random-effects models or models with constraints). It uses the posterior mean $\bar{\boldsymbol{\theta}} = \mathbb{E}_\text{post}[\boldsymbol{\theta}]$ and defines an effective number of parameters:
$$p_D = 2\left[\log p(\mathcal{D} \mid \bar{\boldsymbol{\theta}}) - \mathbb{E}_\text{post}[\log p(\mathcal{D} \mid \boldsymbol{\theta})]\right]$$
The DIC is then:
$$\text{DIC} = -2 \log p(\mathcal{D} \mid \bar{\boldsymbol{\theta}}) + 2 p_D$$
The effective number of parameters $p_D$ is computed from posterior samples: it is twice the difference between the log-likelihood at the posterior mean and the posterior mean of the log-likelihood.
def compute_dic(log_likelihood_fn, posterior_samples):
"""
Compute DIC from posterior samples.
Parameters
----------
log_likelihood_fn : callable
Takes a parameter vector, returns log-likelihood
posterior_samples : array, shape (n_samples, n_params)
Samples from the posterior
"""
# Log-likelihood at the posterior mean
theta_bar = np.mean(posterior_samples, axis=0)
ll_at_mean = log_likelihood_fn(theta_bar)
# Posterior mean of the log-likelihood
ll_samples = np.array([log_likelihood_fn(s) for s in posterior_samples])
mean_ll = np.mean(ll_samples)
# Effective number of parameters
p_D = 2 * (ll_at_mean - mean_ll)
# DIC
dic = -2 * ll_at_mean + 2 * p_D
return dic, p_D4.4 The Widely Applicable Information Criterion (WAIC)
The WAIC (Watanabe, 2010) is the most theoretically principled information criterion for Bayesian models. It uses the full posterior predictive distribution, not just a point estimate:
$$\text{WAIC} = -2 \sum_{i=1}^{N} \log \left(\frac{1}{S}\sum_{s=1}^{S} p(y_i \mid \boldsymbol{\theta}s)\right) + 2 p\text{WAIC}$$
The first term is the log pointwise predictive density (lppd)—the log of the posterior-averaged likelihood for each data point, summed over data points. The second term is the effective number of parameters:
$$p_\text{WAIC} = \sum_{i=1}^{N} \text{Var}_\text{post}[\log p(y_i \mid \boldsymbol{\theta})]$$
This is the sum of the posterior variance of the log-likelihood contributions, computed pointwise.
def compute_waic(log_likelihood_pointwise):
"""
Compute WAIC from pointwise log-likelihoods.
Parameters
----------
log_likelihood_pointwise : array, shape (n_samples, n_data)
log_likelihood_pointwise[s, i] = log p(y_i | theta_s)
"""
S, N = log_likelihood_pointwise.shape
# lppd: log pointwise predictive density
# Use logsumexp for numerical stability
from scipy.special import logsumexp
lppd_i = logsumexp(log_likelihood_pointwise, axis=0) - np.log(S)
lppd = np.sum(lppd_i)
# Effective number of parameters
p_waic = np.sum(np.var(log_likelihood_pointwise, axis=0))
# WAIC
waic = -2 * (lppd - p_waic)
return waic, p_waic, lppdWhy WAIC is preferred over DIC:
- It uses the full posterior, not just the posterior mean.
- It is invariant under reparameterisation.
- The effective number of parameters $p_\text{WAIC}$ is always positive and well defined.
- It asymptotically equals the leave-one-out cross-validation error (Section 6), giving it a clear predictive interpretation.
4.5 Summary of Information Criteria
| Criterion | Penalty | Requires posterior? | Assumptions |
|---|---|---|---|
| AIC | $2p$ | No (MLE only) | Large $N$, model near-Gaussian |
| AIC$_c$ | $2p + \frac{2p(p+1)}{N-p-1}$ | No | Finite-sample correction |
| BIC | $p \log N$ | No (MLE only) | Large $N$, Laplace approximation |
| DIC | $2p_D$ | Yes | Posterior mean is representative |
| WAIC | $2p_\text{WAIC}$ | Yes | Pointwise likelihoods available |
For problems where you have posterior samples (which you should, if you are doing Bayesian inference), WAIC is the default recommendation. If you only have a point estimate, use BIC for model identification or AIC for prediction.
5. The Occam Factor
5.1 Why Does Bayesian Inference Prefer Simpler Models?
The Bayesian evidence has a built-in preference for parsimony that emerges naturally from the mathematics, without any ad hoc penalty term. To see why, consider two models for the same data:
- Model 1 has 1 parameter $\theta_1$ with a broad prior $p(\theta_1) = 1/\Delta\theta_1$
- Model 2 has 2 parameters $(\theta_1, \theta_2)$ with broad priors $p(\theta_1, \theta_2) = 1/(\Delta\theta_1 \cdot \Delta\theta_2)$
Under the Laplace approximation, the evidence for each model is roughly:
$$p(\mathcal{D} \mid \mathcal{M}) \approx p(\mathcal{D} \mid \hat{\boldsymbol{\theta}}) \times \frac{\delta\theta_1 \cdot \delta\theta_2 \cdots \delta\theta_d}{\Delta\theta_1 \cdot \Delta\theta_2 \cdots \Delta\theta_d}$$
where $\delta\theta_j$ is the posterior width (the range of parameter values consistent with the data) and $\Delta\theta_j$ is the prior width. The ratio $\delta\theta_j / \Delta\theta_j$ is the Occam factor for parameter $j$: the fraction of the prior volume that is consistent with the data.
5.2 The Geometry of the Occam Factor
The evidence is the product of the best-fit likelihood and the Occam factor:
$$p(\mathcal{D} \mid \mathcal{M}) \approx \mathcal{L}\text{max} \times \underbrace{\prod{j=1}^{d} \frac{\delta\theta_j}{\Delta\theta_j}}_{\text{Occam factor}}$$
A complex model achieves a higher $\mathcal{L}_\text{max}$ (it fits the data better), but it pays a cost through the Occam factor: each extra parameter contributes a factor $\delta\theta_j / \Delta\theta_j < 1$. If the extra parameter does not substantially improve the fit, the Occam penalty outweighs the improvement, and the simpler model wins.
This is the automatic Occam’s razor of Bayesian inference. It does not need to be imposed—it is a consequence of probability theory. The GP marginal likelihood from lecture 11 is a continuous example of the same mechanism: the complexity-penalty term $-\frac{1}{2}\log|\mathbf{K} + \sigma_n^2 \mathbf{I}|$ is the Occam factor for the infinite-dimensional function space.
5.3 The Role of Priors in Model Comparison
This analysis reveals something important: the evidence depends on the prior widths, not just the prior shape. Widening the prior on $\theta_2$ from $\mathcal{N}(0, 100^2)$ to $\mathcal{N}(0, 1000^2)$ makes the Occam penalty ten times worse for the quadratic model, even though the posterior is essentially unchanged.
For parameter estimation, vague priors are harmless—the posterior is dominated by the likelihood. For model comparison, vague priors are penalised, because they waste prior volume on parameter values that the data exclude. This means:
- Priors must be chosen carefully for model comparison—they are part of the model.
- Improper priors (e.g., flat over $(-\infty, +\infty)$) give a Bayes factor of zero or infinity and cannot be used.
- The sensitivity of the Bayes factor to the prior choice should be assessed (a prior sensitivity analysis).
6. Cross-Validation
Cross-validation takes a fundamentally different approach: instead of computing the probability of the data under the model, it directly measures the model’s predictive performance on held-out data.
6.1 Leave-One-Out Cross-Validation (LOO-CV)
For each data point $y_i$, fit the model to all data except $y_i$, and compute the predictive probability of the held-out point:
$$p(y_i \mid \mathcal{D}{-i}) = \int p(y_i \mid \boldsymbol{\theta})p(\boldsymbol{\theta} \mid \mathcal{D}{-i})\mathrm{d}\boldsymbol{\theta}$$
where $\mathcal{D}_{-i}$ is the dataset with point $i$ removed. The LOO-CV score is:
$$\text{LOO-CV} = \sum_{i=1}^{N} \log p(y_i \mid \mathcal{D}_{-i})$$
Higher is better (more probable predictions). In principle, this requires refitting the model $N$ times—once for each held-out point.
6.2 Pareto-Smoothed Importance Sampling (PSIS-LOO)
Refitting $N$ times is expensive. Vehtari, Gelman & Gabry (2017) showed that LOO-CV can be approximated from a single posterior fit using importance sampling:
$$p(y_i \mid \mathcal{D}{-i}) \approx \frac{1}{\frac{1}{S}\sum{s=1}^{S} \frac{1}{p(y_i \mid \boldsymbol{\theta}_s)}}$$
The weights $w_s^{(i)} \propto 1/p(y_i \mid \boldsymbol{\theta}_s)$ are importance weights that reweight the full-data posterior to approximate the leave-one-out posterior. Raw importance sampling is unstable (this is essentially the harmonic mean estimator in disguise for each data point), but Pareto smoothing of the weight tails stabilises it dramatically.
The arviz library implements this:
import arviz as az
# Assuming you have an InferenceData object from PyMC:
# loo_result = az.loo(inference_data, pointwise=True)
# print(loo_result)
# Manual computation from pointwise log-likelihoods:
def psis_loo_manual(log_lik_pointwise):
"""
Simplified PSIS-LOO.
Parameters
----------
log_lik_pointwise : array, shape (n_samples, n_data)
"""
S, N = log_lik_pointwise.shape
loo_i = np.zeros(N)
for i in range(N):
# Importance weights (in log space)
log_weights = -log_lik_pointwise[:, i]
# Stabilise by subtracting the max
log_weights -= np.max(log_weights)
weights = np.exp(log_weights)
weights /= np.sum(weights)
# LOO predictive density (importance-weighted)
from scipy.special import logsumexp
loo_i[i] = logsumexp(log_lik_pointwise[:, i],
b=weights)
return np.sum(loo_i), loo_i6.3 K-Fold Cross-Validation
When PSIS-LOO is unreliable (too many high-$\hat{k}$ points), $K$-fold cross-validation is the fallback. Partition the data into $K$ disjoint subsets, and for each fold, fit the model to $K-1$ subsets and evaluate on the held-out subset. The cross-validation score is the sum of the held-out log-likelihoods across folds.
$K = 10$ is conventional. $K = N$ recovers LOO-CV exactly.
6.4 Connection Between LOO-CV and WAIC
LOO-CV and WAIC are asymptotically equivalent: as $N \to \infty$, they converge to the same quantity. For finite samples, LOO-CV (via PSIS) is generally more reliable because it does not rely on the asymptotic approximation. In practice, for well-behaved models, the numerical differences between WAIC and PSIS-LOO are small.
This means you have two routes to the same destination. WAIC is computed directly from pointwise log-likelihoods and requires no importance sampling. PSIS-LOO is more robust but requires the importance weight machinery. Use whichever is easier to compute for your problem, and cross-check with the other if the result is borderline.
7. Posterior Predictive Model Checking
Model comparison tells you which model is relatively better. It does not tell you whether any model is absolutely adequate. A model can win a comparison and still be a poor description of the data. Posterior predictive checks (PPCs) fill this gap.
You encountered PPCs in lecture 4 (the Bayesian workflow) and used them implicitly in lecture 5 (residual diagnostics). The idea is simple: if the model is correct, data simulated from the fitted model should look like the observed data.
7.1 The Posterior Predictive Distribution
The posterior predictive distribution for new data $\tilde{y}$ is:
$$p(\tilde{y} \mid \mathcal{D}) = \int p(\tilde{y} \mid \boldsymbol{\theta})p(\boldsymbol{\theta} \mid \mathcal{D})\mathrm{d}\boldsymbol{\theta}$$
In practice, you generate replicated datasets by:
- Drawing $\boldsymbol{\theta}_s$ from the posterior samples
- Simulating $\tilde{y}_s \sim p(\tilde{y} \mid \boldsymbol{\theta}_s)$ from the likelihood
Each replicated dataset $\tilde{y}_s$ is one possible realisation of new data under the fitted model.
7.2 Test Statistics
A PPC compares a test statistic $T(y)$ computed on the real data to the distribution $T(\tilde{y}_s)$ computed on replicated data. If $T(y)$ lies in the tails of the replicated distribution, the model fails to capture whatever feature $T$ measures.
Common test statistics for regression problems:
- Mean and variance of residuals
- Maximum absolute residual (sensitivity to outliers)
- Autocorrelation of residuals at lag 1 (sensitivity to correlated structure)
- Kolmogorov-Smirnov statistic comparing residual distribution to $\mathcal{N}(0,1)$
def posterior_predictive_check(y_obs, y_rep_samples, test_statistic):
"""
Perform a posterior predictive check.
Parameters
----------
y_obs : array, shape (N,)
Observed data
y_rep_samples : array, shape (S, N)
Replicated datasets from the posterior predictive
test_statistic : callable
Function that takes a data vector and returns a scalar
Returns
-------
T_obs : float
Test statistic on observed data
T_rep : array
Test statistics on replicated datasets
p_value : float
Bayesian p-value (fraction of T_rep >= T_obs)
"""
T_obs = test_statistic(y_obs)
T_rep = np.array([test_statistic(yr) for yr in y_rep_samples])
p_value = np.mean(T_rep >= T_obs)
return T_obs, T_rep, p_value
# Example: check whether the model captures the variance
import matplotlib.pyplot as plt
# Simulate replicated data (assuming posterior_samples and x are defined)
y_rep = np.array([
np.polyval(np.polyfit(x, y, 1, w=1/np.full(N, sigma_y)), x)
+ np.random.normal(0, sigma_y, N)
for _ in range(1000)
])
T_obs, T_rep, pval = posterior_predictive_check(
y, y_rep, test_statistic=np.var
)
print(f"Observed variance: {T_obs:.2f}")
print(f"Replicated mean: {np.mean(T_rep):.2f}")
print(f"Bayesian p-value: {pval:.3f}")7.3 PPCs and Model Comparison Together
The correct workflow uses both tools:
- PPC first: check that each candidate model adequately describes the data. If a model fails basic PPCs, eliminate it before comparing.
- Compare survivors: use Bayes factors, information criteria, or cross-validation to rank the models that pass the PPCs.
- Iterate: if no model passes, build a better one using the PPC failures as guidance (this is the workflow loop from lecture 4).
8. Common Pitfalls
8.1 The Lindley Paradox
The Lindley paradox arises when a frequentist test rejects the null hypothesis (small $p$-value) but the Bayesian analysis strongly favours the null (large Bayes factor for $\mathcal{M}_0$). This happens when:
- The data show a statistically significant but scientifically small effect.
- The prior on the alternative model’s parameter is very broad.
Under these conditions, the Occam factor destroys the alternative model. The broad prior wastes so much volume on parameter values far from the data that the evidence is low, even though the maximum likelihood is higher.
The resolution is not that one framework is right and the other wrong. They are answering different questions. The frequentist test asks: “Is the effect distinguishable from zero?” The Bayesian comparison asks: “Given my prior beliefs about plausible effect sizes, does adding this parameter improve the model enough to justify the added complexity?”
In practice, if you encounter the Lindley paradox, the first thing to check is your prior. A prior of $\mathcal{N}(0, 10^6)$ on an effect that could plausibly be at most $\mathcal{O}(1)$ is not “uninformative”—it is misinformative for model comparison.
8.2 Sensitivity to Prior Choice
Because the evidence depends on the prior volume (Section 5.3), Bayes factors can be sensitive to the prior specification in ways that do not affect parameter estimation. Best practices:
- Use informative priors motivated by physics. “What range of values is plausible for this parameter?” is a question you can usually answer.
- Report results for multiple prior choices. If the Bayes factor changes sign when you double the prior width, the data are not informative enough for model comparison.
- Consider the Savage-Dickey density ratio for nested models, which expresses the Bayes factor in terms of the posterior and prior density at a single point (the restriction defining the simpler model).
8.3 Improper Priors
As noted in Section 5.3, improper priors produce undefined evidence values. If you are using flat (uniform over $\mathbb{R}$) priors for convenience in parameter estimation, you must replace them with proper priors before computing Bayes factors. A common strategy is to use the same proper prior in both models for shared parameters, so that the Occam penalty applies only to the parameters that differ.
8.4 Comparing Non-Nested Models
AIC and BIC make no assumption about whether models are nested (i.e., whether one is a special case of the other). Neither does the Bayes factor. You can compare a GP model against a polynomial, or a mixture model against a single-component model, or a Student-$t$ likelihood against a Gaussian likelihood. The only requirement is that all models define a proper likelihood for the same observed data $\mathcal{D}$.
However, be cautious when comparing models that parameterise the data differently (e.g., one model fits magnitudes and another fits fluxes). The likelihoods must be on the same scale—which means including the appropriate Jacobian if you transform the data.
8.5 The Number-of-Components Problem
In mixture models (lecture 13), choosing $K$ is a model comparison problem with a special complication: the models are not nested in the usual sense, because the labelling of components is arbitrary (the “label switching” problem). BIC works reasonably well despite this, as we saw in lecture 13. For a fully Bayesian treatment, non-parametric mixture models (e.g., Dirichlet process mixtures) let the data determine $K$, but they are computationally expensive and outside the scope of this course.
9. A Practical Decision Framework
Given the profusion of methods, here is a practical guide for deciding which to use:
9.1 The Quick Check
For a quick comparison between a small number of models:
- Compute AIC and BIC for each model. These require only the maximum log-likelihood and the parameter count. If they agree, you probably have your answer.
- If they disagree, the data are in a regime where the penalty matters. Proceed to a more careful analysis.
9.2 The Bayesian Analysis
If you have posterior samples (which you should for non-trivial problems):
- Compute WAIC or PSIS-LOO. These are the most principled information criteria and can be computed directly from posterior samples. Use
arvizfor convenience. - Run posterior predictive checks. Ensure all candidate models pass basic adequacy tests before comparing them.
- If you need the actual evidence (e.g., for reporting a Bayes factor), use nested sampling with
dynestyorUltraNest.
9.3 The Full Treatment
For high-stakes model comparisons (e.g., claiming a detection):
- Nested sampling for the evidence, with error estimates.
- Prior sensitivity analysis: how does the Bayes factor change with the prior width?
- Multiple comparison metrics: do AIC/BIC, WAIC, LOO-CV, and the Bayes factor agree?
- PPCs: does the winning model actually describe the data well?
# Complete example: comparing a linear vs quadratic model with multiple criteria
from scipy.stats import norm
np.random.seed(42)
N = 30
sigma_y = 2.0
x = np.sort(np.random.uniform(-5, 5, N))
y_true = 2.0 + 0.8 * x + 0.05 * x**2
y = y_true + np.random.normal(0, sigma_y, N)
results = {}
for name, degree in [("Linear", 1), ("Quadratic", 2), ("Cubic", 3)]:
coeffs = np.polyfit(x, y, degree, w=1.0/np.full(N, sigma_y))
y_pred = np.polyval(coeffs, x)
p = degree + 1
# Maximum log-likelihood
log_L = np.sum(norm.logpdf(y, loc=y_pred, scale=sigma_y))
# Information criteria
aic = -2 * log_L + 2 * p
aicc = aic + 2 * p * (p + 1) / (N - p - 1)
bic = -2 * log_L + p * np.log(N)
# Reduced chi-squared
chi2_red = np.sum(((y - y_pred) / sigma_y)**2) / (N - p)
results[name] = {
"p": p, "log_L": log_L,
"AIC": aic, "AICc": aicc, "BIC": bic,
"chi2_red": chi2_red
}
# Print comparison table
print(f"{'Model':<12} {'p':>3} {'log L':>8} {'AIC':>8} "
f"{'AICc':>8} {'BIC':>8} {'chi2/dof':>8}")
print("-" * 60)
for name, r in results.items():
print(f"{name:<12} {r['p']:>3} {r['log_L']:>8.1f} {r['AIC']:>8.1f} "
f"{r['AICc']:>8.1f} {r['BIC']:>8.1f} {r['chi2_red']:>8.2f}")10. Astronomical Applications
10.1 Exoplanet Detection: Signal or Noise?
The canonical model comparison problem in exoplanet science is: does the radial velocity data contain a planetary signal, or is a flat line (plus noise) sufficient?
- Model 0: $v(t) = v_0 + \varepsilon(t)$, a constant velocity plus white noise
- Model 1: $v(t) = v_0 + K \sin(2\pi t / P + \phi) + \varepsilon(t)$, a Keplerian signal
Model 1 has three extra parameters ($K$, $P$, $\phi$). The maximum likelihood of Model 1 is always at least as high as Model 0. The question is whether the improvement justifies the extra parameters.
In practice, the noise is often correlated (stellar activity), so $\varepsilon(t)$ is modelled with a GP (lecture 11). The model comparison then becomes: does adding a Keplerian to the GP improve the evidence? This is where nested sampling earns its keep—the evidence integrals are high-dimensional and multimodal (the period $P$ creates sharp likelihood peaks), and MCMC-based evidence estimates are unreliable.
10.2 How Many Components?
In lecture 13 we used BIC to determine the number of Gaussian components in a mixture model. The Bayesian approach uses the evidence:
$$p(\mathcal{D} \mid K) = \int p(\mathcal{D} \mid \boldsymbol{\theta}_K, K)p(\boldsymbol{\theta}_K \mid K)\mathrm{d}\boldsymbol{\theta}_K$$
For mixture models, this integral is particularly challenging because of label switching (the likelihood is invariant under permutations of the component labels). Nested sampling handles this gracefully by exploring the full prior volume, including all label-switched modes.
The practical workflow:
- Fit the mixture model for $K = 1, 2, \ldots, K_\text{max}$ using nested sampling.
- Compare the evidences (or equivalently, the Bayes factors $B_{K,K-1}$).
- Select the $K$ with the highest evidence, subject to a PPC confirming that the model describes the data.
10.3 Model Selection for Spectral Fitting
When fitting an astronomical spectrum, you might consider:
- Is the absorption line a single Gaussian or a double (blended) feature?
- Does the continuum require a polynomial of degree 2 or 3?
- Is there a broad emission component underlying the narrow absorption?
Each question is a model comparison. The BIC is fast and often sufficient for these problems, because the models are simple (few parameters), the data are abundant (many spectral pixels), and the likelihoods are well approximated by Gaussians.
# Example: single vs double Gaussian absorption line
from scipy.optimize import minimize
def single_gaussian(x, theta):
"""Single Gaussian absorption line on a flat continuum."""
A, mu, sigma, cont = theta
return cont - A * np.exp(-0.5 * ((x - mu) / sigma)**2)
def double_gaussian(x, theta):
"""Double Gaussian absorption (blended line)."""
A1, mu1, sig1, A2, mu2, sig2, cont = theta
g1 = A1 * np.exp(-0.5 * ((x - mu1) / sig1)**2)
g2 = A2 * np.exp(-0.5 * ((x - mu2) / sig2)**2)
return cont - g1 - g2
# Simulated spectral data
np.random.seed(123)
wavelength = np.linspace(5000, 5020, 200)
noise_level = 0.02
# True model: two blended lines
true_flux = (1.0
- 0.3 * np.exp(-0.5 * ((wavelength - 5008) / 0.8)**2)
- 0.15 * np.exp(-0.5 * ((wavelength - 5010.5) / 0.6)**2))
flux = true_flux + np.random.normal(0, noise_level, len(wavelength))
# Fit single Gaussian
def neg_ll_single(theta):
model = single_gaussian(wavelength, theta)
return -np.sum(norm.logpdf(flux, loc=model, scale=noise_level))
res1 = minimize(neg_ll_single, x0=[0.3, 5009, 1.5, 1.0], method="Nelder-Mead")
ll_single = -res1.fun
# Fit double Gaussian
def neg_ll_double(theta):
model = double_gaussian(wavelength, theta)
return -np.sum(norm.logpdf(flux, loc=model, scale=noise_level))
res2 = minimize(neg_ll_double,
x0=[0.3, 5008, 0.8, 0.15, 5010.5, 0.6, 1.0],
method="Nelder-Mead")
ll_double = -res2.fun
# Compare with BIC
bic_single = -2 * ll_single + 4 * np.log(len(wavelength))
bic_double = -2 * ll_double + 7 * np.log(len(wavelength))
delta_bic = bic_single - bic_double # positive favours double
print(f"Single Gaussian: log L = {ll_single:.1f}, BIC = {bic_single:.1f}")
print(f"Double Gaussian: log L = {ll_double:.1f}, BIC = {bic_double:.1f}")
print(f"Delta BIC = {delta_bic:.1f} (positive favours double)")11. Connections to Earlier Lectures
Model comparison is not a standalone topic. It connects to nearly everything in this course:
The Bayesian workflow (lecture 4) includes model checking as step 6. This lecture formalises the comparison step and adds quantitative tools (evidence, information criteria) to the qualitative checks (PPCs) we have been using throughout.
Fitting a line (lectures 5–6): the reduced $\chi^2$ diagnostic ($\chi^2_\nu \approx 1$) is a rudimentary model check. Formally, it answers the question “is this model consistent with the data?” but not “is this model better than that one?”
The GP marginal likelihood (lectures 10–11) is exactly the Bayesian evidence for a GP model. When you optimise GP hyperparameters by maximising $\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\psi})$, you are doing Type-II model comparison: selecting the kernel hyperparameters that best explain the data, with the Occam factor automatically penalising overly flexible kernels.
Hierarchical models (lecture 12): model comparison applies at the hyperparameter level. Does the population need a log-normal distribution or a mixture? The evidence for each choice integrates over all group-level parameters and the hyperparameters.
Mixture models (lecture 13): choosing $K$ is model comparison. The BIC worked well in Section 9 of that lecture. Now you understand why: BIC approximates $-2 \log p(\mathcal{D} \mid K)$, and the $p \log N$ penalty approximates the Occam factor for each additional component.
12. Summary
The model comparison problem asks: given competing models, which should you prefer? The core tension is between fit quality (complex models always fit better) and parsimony (simple models generalise better).
Bayesian model comparison uses the marginal likelihood (evidence) $p(\mathcal{D} \mid \mathcal{M}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}) \mathrm{d}\boldsymbol{\theta}$, which averages the likelihood over the prior. The ratio of evidences is the Bayes factor, interpreted via the Jeffreys scale. The evidence naturally penalises complexity through the Occam factor: the fraction of the prior volume consistent with the data.
Computing the evidence is the hard part. The harmonic mean estimator is useless. Nested sampling (use dynesty or UltraNest) is the standard in astronomy. Thermodynamic integration and the Laplace approximation are alternatives for specific problems.
Information criteria approximate model comparison without computing the full evidence. AIC estimates predictive accuracy; BIC approximates the evidence; DIC and WAIC extend these to Bayesian models with effective parameter counts. WAIC is the default recommendation when posterior samples are available.
Cross-validation directly measures predictive performance. LOO-CV, efficiently approximated via PSIS-LOO, is asymptotically equivalent to WAIC and provides per-point diagnostics that identify influential observations.
Posterior predictive checks test whether a model is absolutely adequate, not just relatively better. Always PPC before comparing. A model that fails basic PPCs should not be in the comparison set.
Common pitfalls: the Lindley paradox (broad priors destroy the evidence for the alternative), sensitivity to prior width (priors matter for model comparison even when they do not matter for parameter estimation), improper priors (make the evidence undefined), and the temptation to use the harmonic mean estimator (do not).
The practical workflow: compute AIC/BIC for a quick check; use WAIC or PSIS-LOO for a careful Bayesian analysis; run nested sampling for a rigorous evidence computation; and always run PPCs to verify adequacy.
Further Reading
- Trotta, R. (2008). Bayes in the sky: Bayesian inference and model selection in cosmology. Contemporary Physics, 49, 71. [Excellent review of Bayesian model comparison in astrophysics]
- Kass, R. E. & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773. [The foundational reference on Bayes factors and their interpretation]
- Skilling, J. (2004). Nested sampling. AIP Conference Proceedings, 735, 395. [The original nested sampling paper]
- Speagle, J. S. (2020). dynesty: a dynamic nested sampling package for estimating Bayesian posteriors and evidences. MNRAS, 493, 3132. [The dynesty paper — the most widely used nested sampler in astronomy]
- Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413. [The PSIS-LOO paper — essential for modern Bayesian model comparison]
- Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24, 997. [Clear exposition of AIC, BIC, DIC, and WAIC]
- Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular statistical models. JMLR, 11, 3571. [The original WAIC paper]
- Jeffreys, H. (1961). Theory of Probability, 3rd edition. Oxford University Press. [The original Jeffreys scale for Bayes factor interpretation]
- Nelson, B. E. et al. (2020). Quantifying the Bayesian evidence for a planet in radial velocity data. AJ, 159, 73. [Practical application of nested sampling for exoplanet detection]