Comparing And Selecting Models

All models are wrong. Some are useful. But which ones?

Throughout this course you have built models of increasing sophistication: straight lines (lecture 5), polynomial and basis-function expansions (lecture 7), Gaussian processes (lectures 10--11), hierarchical models (lecture 12), and mixture models (lecture 13). Each time, the scientific question was about the parameters within a model. But there is a prior question that we have been deferring: given two or more competing models, which one should you prefer?

This is the model comparison problem, and it is unavoidable. Is the radial velocity signal from a planet or from stellar activity? Does a colour-magnitude diagram need two Gaussian components or three? Is a quadratic fit justified over a straight line, or is the improvement just noise fitting? Every time you add a parameter, change a likelihood, or swap one functional form for another, you are implicitly comparing models. This lecture makes that comparison explicit and principled.

The core tension is simple: complex models fit better, but simple models generalise better. Every framework for model comparison---Bayesian evidence, information criteria, cross-validation---is a different way of navigating that tension. Understanding when they agree, when they disagree, and why is the point of this lecture.

1. Why Model Comparison Is Hard

1.1 More Parameters Always Improve the Fit

Consider fitting a polynomial of degree $d$ to $N$ data points. As $d$ increases, the residual sum of squares decreases monotonically---a degree-$(N-1)$ polynomial passes through every data point exactly, giving $\chi^2 = 0$. By the metric of fit quality alone, the most complex model always wins.

This is not a useful criterion. A degree-19 polynomial fit to 20 noisy points has memorised the noise, not learned the signal. Its predictions at new $x$ values will be catastrophically wrong.

1.2 The Maximum Likelihood Is Not Enough

You might hope to compare models by their maximum log-likelihood:

$$\log \mathcal{L}_\text{max}(\mathcal{M}_1) \quad \text{vs.} \quad \log \mathcal{L}_\text{max}(\mathcal{M}_2)$$

But the model with more parameters will always achieve a higher (or equal) maximum likelihood, because it has more freedom to fit the data. The maximum likelihood does not penalise complexity---it rewards it.

1.3 What We Need

A model comparison framework must balance two competing demands:

Goodness of fit: the model should explain the observed data
Parsimony: the model should not use more structure than the data require

Every method in this lecture implements this balance differently. Bayesian evidence integrates over the prior volume. Information criteria add an explicit penalty term. Cross-validation tests predictive performance on held-out data. They are not interchangeable---each encodes different assumptions---but for well-specified problems they usually agree.

2. Bayesian Model Comparison

2.1 The Posterior Over Models

Bayes' theorem applies to models just as it applies to parameters. If $\mathcal{M}_k$ is a model and $\mathcal{D}$ is the observed data:

$$p(\mathcal{M}_k \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \mathcal{M}_k)p(\mathcal{M}_k)}{p(\mathcal{D})}$$

where:

$p(\mathcal{M}_k \mid \mathcal{D})$ is the posterior probability of the model
$p(\mathcal{D} \mid \mathcal{M}_k)$ is the marginal likelihood (or evidence) for model $\mathcal{M}_k$
$p(\mathcal{M}_k)$ is the prior probability of the model
$p(\mathcal{D}) = \sum_k p(\mathcal{D} \mid \mathcal{M}_k) p(\mathcal{M}_k)$ is a normalising constant

The evidence is the quantity that does all the work. It is the probability of the observed data, averaged over the prior on the parameters:

$$p(\mathcal{D} \mid \mathcal{M}_k) = \int p(\mathcal{D} \mid \boldsymbol{\theta}_k, \mathcal{M}_k)p(\boldsymbol{\theta}_k \mid \mathcal{M}_k)\mathrm{d}\boldsymbol{\theta}_k$$

This is the same integral that appears as the normalising constant in Bayes' theorem for parameters (lecture 4, Section 1.2). We ignored it then because it does not depend on $\boldsymbol{\theta}$. Now it is the central object of interest, because it does depend on the model.

2.2 The Bayes Factor

To compare two models, take the ratio of their posterior probabilities:

$$\frac{p(\mathcal{M}_2 \mid \mathcal{D})}{p(\mathcal{M}_1 \mid \mathcal{D})} = \frac{p(\mathcal{D} \mid \mathcal{M}_2)}{p(\mathcal{D} \mid \mathcal{M}_1)} \cdot \frac{p(\mathcal{M}_2)}{p(\mathcal{M}_1)}$$

The first factor on the right is the Bayes factor:

$$B_{21} = \frac{p(\mathcal{D} \mid \mathcal{M}_2)}{p(\mathcal{D} \mid \mathcal{M}_1)}$$

If we assign equal prior probabilities to both models ($p(\mathcal{M}_1) = p(\mathcal{M}_2)$), the posterior odds ratio equals the Bayes factor. In practice, equal model priors are the default unless you have strong reason to prefer one model a priori.

2.3 Interpreting Bayes Factors: The Jeffreys Scale

Harold Jeffreys proposed a scale for interpreting Bayes factors that remains widely used:

$\log_{10} B_{21}$	$B_{21}$	Strength of evidence for $\mathcal{M}_2$
$< 0$	$< 1$	Favours $\mathcal{M}_1$
$0$ to $0.5$	$1$ to $3$	Barely worth mentioning
$0.5$ to $1$	$3$ to $10$	Substantial
$1$ to $2$	$10$ to $100$	Strong
$> 2$	$> 100$	Decisive

⚠️

The Jeffreys scale is a rough guide, not a decision boundary. A Bayes factor of 3.1 is not qualitatively different from 2.9. The categories are conventional labels for communication, not thresholds that trigger action. Always report the actual number alongside any verbal summary.

2.4 A Worked Example: Linear vs. Quadratic

To make this concrete, consider fitting $N = 20$ data points generated from a true quadratic $y = 3 + 0.5x + 0.02x^2 + \varepsilon$, with $\varepsilon \sim \mathcal{N}(0, 1.75^2)$.

Model 1 (linear): $y = \theta_0 + \theta_1 x + \varepsilon$, with priors $\theta_0, \theta_1 \sim \mathcal{N}(0, 100^2)$.

Model 2 (quadratic): $y = \theta_0 + \theta_1 x + \theta_2 x^2 + \varepsilon$, with priors $\theta_0, \theta_1, \theta_2 \sim \mathcal{N}(0, 100^2)$.

import numpy as np
from scipy.integrate import quad, dblquad, tplquad
from scipy.stats import norm

# Generate data from a quadratic
np.random.seed(42)
N = 20
sigma_y = 1.75
x = np.sort(np.random.uniform(-5, 5, N))
y_true = 3.0 + 0.5 * x + 0.02 * x**2
y = y_true + np.random.normal(0, sigma_y, N)

def log_likelihood(y, y_pred, sigma):
    return np.sum(norm.logpdf(y, loc=y_pred, scale=sigma))

# --- Evidence for the linear model (2D integral) ---
def integrand_linear(theta0, theta1):
    y_pred = theta0 + theta1 * x
    ll = log_likelihood(y, y_pred, sigma_y)
    lp = norm.logpdf(theta0, 0, 100) + norm.logpdf(theta1, 0, 100)
    return np.exp(ll + lp)

evidence_linear, err_linear = dblquad(
    integrand_linear,
    -10, 10,    # theta1 limits (guided by data)
    -10, 20,    # theta0 limits
)

# --- Evidence for the quadratic model (3D integral) ---
def integrand_quadratic(theta0, theta1, theta2):
    y_pred = theta0 + theta1 * x + theta2 * x**2
    ll = log_likelihood(y, y_pred, sigma_y)
    lp = (norm.logpdf(theta0, 0, 100) + norm.logpdf(theta1, 0, 100)
           + norm.logpdf(theta2, 0, 100))
    return np.exp(ll + lp)

evidence_quadratic, err_quadratic = tplquad(
    integrand_quadratic,
    -5, 5,       # theta2 limits
    -10, 10,     # theta1 limits
    -10, 20,     # theta0 limits
)

B_21 = evidence_quadratic / evidence_linear
print(f"log Evidence (linear):    {np.log(evidence_linear):.2f}")
print(f"log Evidence (quadratic): {np.log(evidence_quadratic):.2f}")
print(f"Bayes factor B_21:        {B_21:.2f}")
print(f"log10(B_21):              {np.log10(B_21):.2f}")

ℹ️

Don't be surprised if the simpler model wins. Even when the data were generated from the quadratic model, the Bayes factor can favour the linear model---especially with broad priors and modest signal-to-noise. This is the Occam's razor at work (Section 5): the linear model makes sharper predictions over a smaller parameter volume, and if the curvature is small compared to the noise, the data cannot distinguish the models. This is not a failure of the method. It is telling you that, given these data, the extra parameter is not justified.

3. Computing the Evidence

The evidence integral

$$p(\mathcal{D} \mid \mathcal{M}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}, \mathcal{M})p(\boldsymbol{\theta} \mid \mathcal{M})\mathrm{d}\boldsymbol{\theta}$$

is the central computational challenge in Bayesian model comparison. Unlike parameter estimation, where MCMC can avoid computing the normalising constant, model comparison requires the normalising constant itself. And this integral is hard---often harder than the parameter estimation problem that motivated it.

3.1 Why It Is Hard

The difficulty is dimensional. The prior typically occupies a large volume in parameter space, but the likelihood is sharply peaked in a much smaller region. The evidence is the average of the likelihood over the prior, which means most of the prior volume contributes negligibly to the integral. Naive Monte Carlo sampling from the prior is therefore hopelessly inefficient for all but the simplest models: you would need to draw an astronomically large number of prior samples before any of them landed in the likelihood peak.

3.2 The Harmonic Mean Estimator (and Why It Fails)

Given posterior samples $\{\boldsymbol{\theta}_s\}_{s=1}^{S}$ from MCMC, a tempting identity gives:

$$\frac{1}{p(\mathcal{D} \mid \mathcal{M})} = \int \frac{p(\boldsymbol{\theta} \mid \mathcal{M})}{p(\mathcal{D} \mid \boldsymbol{\theta}, \mathcal{M})p(\boldsymbol{\theta} \mid \mathcal{M})}p(\boldsymbol{\theta} \mid \mathcal{D}, \mathcal{M})\mathrm{d}\boldsymbol{\theta} = \mathbb{E}_{\text{post}}\left[\frac{1}{p(\mathcal{D} \mid \boldsymbol{\theta}, \mathcal{M})}\right]$$

This suggests the estimator:

$$\hat{p}(\mathcal{D} \mid \mathcal{M}) = \left[\frac{1}{S}\sum_{s=1}^{S} \frac{1}{p(\mathcal{D} \mid \boldsymbol{\theta}_s, \mathcal{M})}\right]^{-1}$$

which is the harmonic mean of the likelihoods evaluated at posterior samples. It is unbiased and can be computed from any set of posterior samples.

⚠️

The harmonic mean estimator is the worst Monte Carlo estimator ever invented. This is not hyperbole---it is a direct quote from Radford Neal. The estimator has infinite variance whenever the posterior has lighter tails than the likelihood, which is almost always. In practice, a single posterior sample with an unusually low likelihood dominates the sum, producing estimates that fluctuate wildly between runs. It can appear to converge while being orders of magnitude wrong. Do not use it.

3.3 Nested Sampling

Nested sampling, introduced by John Skilling (2004), is the most widely used method for evidence computation in astronomy. The key insight is to transform the multi-dimensional evidence integral into a one-dimensional integral over the prior volume.

Define the prior volume enclosed within a likelihood contour $\mathcal{L}^* = L$:

$$X(L) = \int_{\mathcal{L}(\boldsymbol{\theta}) > L} p(\boldsymbol{\theta} \mid \mathcal{M})\mathrm{d}\boldsymbol{\theta}$$

As $L$ increases from 0 to $\mathcal{L}_\text{max}$, $X$ decreases from 1 to 0. The evidence is then:

$$p(\mathcal{D} \mid \mathcal{M}) = \int_0^1 L(X)\mathrm{d}X$$

This is a one-dimensional integral that can be approximated numerically. The algorithm works by maintaining a set of $n_\text{live}$ "live points" drawn from the prior, iteratively replacing the lowest-likelihood point with a new point drawn from the prior but constrained to have higher likelihood. At each iteration, the prior volume shrinks by a factor of approximately $n_\text{live} / (n_\text{live} + 1)$.

# Conceptual nested sampling (simplified)
# In practice, use dynesty, UltraNest, or MultiNest

def nested_sampling_simple(log_likelihood_fn, prior_sample_fn,
                           n_live=400, max_iter=10000):
    """
    Simplified nested sampling for pedagogical purposes.

    Parameters
    ----------
    log_likelihood_fn : callable
        Returns log-likelihood given parameter vector
    prior_sample_fn : callable
        Returns a sample drawn uniformly from the prior
    n_live : int
        Number of live points
    max_iter : int
        Maximum iterations
    """
    # Initialise live points from the prior
    live_points = np.array([prior_sample_fn() for _ in range(n_live)])
    live_logl = np.array([log_likelihood_fn(p) for p in live_points])

    log_evidence = -np.inf
    log_X = 0.0  # log of remaining prior volume

    dead_points = []
    dead_logl = []

    for i in range(max_iter):
        # Find worst live point
        worst = np.argmin(live_logl)
        L_threshold = live_logl[worst]

        # Shrink the prior volume
        log_shrink = -1.0 / n_live
        log_width = log_X + np.log(1 - np.exp(log_shrink))
        log_X += log_shrink

        # Accumulate evidence
        log_evidence = np.logaddexp(log_evidence,
                                     L_threshold + log_width)

        dead_points.append(live_points[worst].copy())
        dead_logl.append(L_threshold)

        # Replace worst point with a new sample above threshold
        # (In practice, this is the hard part — use constrained
        #  prior sampling, e.g., ellipsoidal decomposition)
        while True:
            new_point = prior_sample_fn()
            new_logl = log_likelihood_fn(new_point)
            if new_logl > L_threshold:
                live_points[worst] = new_point
                live_logl[worst] = new_logl
                break

        # Convergence check
        max_remaining = np.max(live_logl) + log_X
        if max_remaining < log_evidence - 5:  # negligible remaining
            break

    return log_evidence, np.array(dead_points), np.array(dead_logl)

ℹ️

In practice, use a dedicated nested sampling package. The hard part of nested sampling is the constrained prior sampling step: drawing new points uniformly from the prior within the likelihood contour. Packages like dynesty, UltraNest, and MultiNest/PyMultiNest implement sophisticated strategies (ellipsoidal decomposition, slice sampling, neural network proposals) to do this efficiently in high dimensions. The conceptual algorithm above illustrates the logic but would fail for non-trivial problems due to the rejection sampling step becoming prohibitively expensive.

3.4 Thermodynamic Integration

Thermodynamic integration (also called power posteriors or path sampling) connects the prior to the posterior via a continuous "temperature" parameter $\beta \in [0, 1]$:

$$p_\beta(\boldsymbol{\theta} \mid \mathcal{D}) \propto p(\mathcal{D} \mid \boldsymbol{\theta})^\betap(\boldsymbol{\theta})$$

At $\beta = 0$, this is the prior. At $\beta = 1$, this is the posterior. The log-evidence is:

$$\log p(\mathcal{D} \mid \mathcal{M}) = \int_0^1 \mathbb{E}_{\beta}\left[\log p(\mathcal{D} \mid \boldsymbol{\theta})\right] \mathrm{d}\beta$$

where $\mathbb{E}_\beta[\cdot]$ denotes the expectation under $p_\beta$. In practice, you run MCMC chains at a sequence of temperatures $0 = \beta_0 < \beta_1 < \cdots < \beta_T = 1$, compute $\langle \log \mathcal{L} \rangle_{\beta_t}$ at each temperature, and numerically integrate.

def thermodynamic_integration(log_likelihood_fn, log_prior_fn,
                               sampler_fn, betas, n_samples=2000):
    """
    Estimate log-evidence via thermodynamic integration.

    Parameters
    ----------
    log_likelihood_fn : callable
        Log-likelihood function
    log_prior_fn : callable
        Log-prior function
    sampler_fn : callable
        sampler_fn(log_target, n_samples) returns samples from
        a distribution proportional to exp(log_target)
    betas : array
        Temperature schedule from 0 to 1
    n_samples : int
        Samples per temperature
    """
    mean_logl = np.zeros(len(betas))

    for i, beta in enumerate(betas):
        # Define the tempered log-posterior
        def log_target(theta):
            return beta * log_likelihood_fn(theta) + log_prior_fn(theta)

        # Sample from the tempered distribution
        samples = sampler_fn(log_target, n_samples)

        # Compute mean log-likelihood at this temperature
        logl_values = np.array([log_likelihood_fn(s) for s in samples])
        mean_logl[i] = np.mean(logl_values)

    # Numerical integration (trapezoidal rule)
    log_evidence = np.trapz(mean_logl, betas)
    return log_evidence

The advantage of thermodynamic integration is that it reuses MCMC machinery you already have. The disadvantage is that it requires many MCMC runs (one per temperature), so the total computational cost can be high.

3.5 The Laplace Approximation

For models where the posterior is approximately Gaussian (which it often is for well-constrained linear or mildly non-linear models), the evidence can be approximated analytically. Expand the log-posterior around its maximum $\hat{\boldsymbol{\theta}}$:

$$\log p(\mathcal{D} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}) \approx \log p(\mathcal{D} \mid \hat{\boldsymbol{\theta}}) p(\hat{\boldsymbol{\theta}}) - \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^\top \mathbf{H} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})$$

where $\mathbf{H}$ is the Hessian of the negative log-posterior at the maximum. The Gaussian integral then gives:

$$p(\mathcal{D} \mid \mathcal{M}) \approx p(\mathcal{D} \mid \hat{\boldsymbol{\theta}})p(\hat{\boldsymbol{\theta}})(2\pi)^{d/2}|\mathbf{H}|^{-1/2}$$

where $d$ is the number of parameters. Taking the logarithm:

$$\log p(\mathcal{D} \mid \mathcal{M}) \approx \log p(\mathcal{D} \mid \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$

This is fast (requires only the MAP estimate and the Hessian) and often surprisingly accurate. It fails when the posterior is multimodal, highly skewed, or has heavy tails. For the GP marginal likelihood (lecture 11), this is exactly the form we used---the GP log-marginal likelihood already is the Laplace approximation to the evidence with the function values marginalised out analytically.

ℹ️

The BIC (Section 4.2) is a further approximation to the Laplace approximation. The Laplace approximation keeps the full Hessian; the BIC replaces the Hessian with an asymptotic scaling proportional to $N$. This makes the BIC cheaper to compute but less accurate for small samples.

4. Information Criteria

When computing the full evidence is too expensive, information criteria provide quick approximations. They all share the same structure: a goodness-of-fit term (the maximum log-likelihood) minus a complexity penalty. The criteria differ in how they define the penalty.

4.1 The Akaike Information Criterion (AIC)

The AIC is derived from an asymptotic approximation to the expected Kullback-Leibler divergence between the fitted model and the true data-generating distribution:

$$\text{AIC} = -2 \log \mathcal{L}_\text{max} + 2p$$

where $p$ is the number of free parameters and $\mathcal{L}_\text{max} = p(\mathcal{D} \mid \hat{\boldsymbol{\theta}}_\text{MLE})$ is the maximum likelihood.

Key properties:

The penalty is $2p$, independent of sample size $N$.
It estimates out-of-sample predictive accuracy, not the probability that the model is correct.
It is not consistent: as $N \to \infty$, if the true model is among the candidates, AIC does not converge to selecting it with probability 1 (it tends to select overly complex models).
It is efficient: among the candidate models, AIC asymptotically selects the one that minimises the prediction error, even if none of the candidates is correct.

For small samples, the corrected version is preferred:

$$\text{AIC}_c = \text{AIC} + \frac{2p(p+1)}{N - p - 1}$$

which adds a stronger penalty when $N$ is not much larger than $p$.

4.2 The Bayesian Information Criterion (BIC)

The BIC arises from the Laplace approximation to the log-evidence:

$$\text{BIC} = -2 \log \mathcal{L}_\text{max} + p \log N$$

Key properties:

The penalty grows with $N$ (as $p \log N$), so BIC penalises complexity more strongly than AIC for $N \geq 8$.
It approximates $-2 \log p(\mathcal{D} \mid \mathcal{M})$ up to a constant that does not depend on the model.
It is consistent: as $N \to \infty$, BIC selects the true model with probability 1 (if it is in the candidate set).
It is not efficient: it can select a model with worse predictive performance than the one AIC picks.

def compute_aic_bic(log_likelihood_max, n_params, n_data):
    """Compute AIC, AICc, and BIC."""
    aic = -2 * log_likelihood_max + 2 * n_params
    aicc = aic + 2 * n_params * (n_params + 1) / (n_data - n_params - 1)
    bic = -2 * log_likelihood_max + n_params * np.log(n_data)
    return aic, aicc, bic

# Example: compare polynomial fits of degree 1, 2, 3
for degree in [1, 2, 3]:
    coeffs = np.polyfit(x, y, degree, w=1.0/np.full(N, sigma_y))
    y_pred = np.polyval(coeffs, x)
    log_L = np.sum(norm.logpdf(y, loc=y_pred, scale=sigma_y))
    p = degree + 1  # number of coefficients
    aic, aicc, bic = compute_aic_bic(log_L, p, N)
    print(f"Degree {degree}: log L = {log_L:.1f}, "
          f"AIC = {aic:.1f}, AICc = {aicc:.1f}, BIC = {bic:.1f}")

ℹ️

AIC and BIC answer different questions. AIC asks: which model will make the best predictions on new data? BIC asks: which model most likely generated the data? These are not the same question, and the answers can differ---especially when the true model is simple and the sample is large (BIC wins) or when no candidate model is exactly correct and you care about prediction (AIC wins). Neither is universally better.

4.3 The Deviance Information Criterion (DIC)

The DIC was introduced for hierarchical models where the "number of parameters" is not well defined (e.g., in random-effects models or models with constraints). It uses the posterior mean $\bar{\boldsymbol{\theta}} = \mathbb{E}_\text{post}[\boldsymbol{\theta}]$ and defines an effective number of parameters:

$$p_D = 2\left[\log p(\mathcal{D} \mid \bar{\boldsymbol{\theta}}) - \mathbb{E}_\text{post}[\log p(\mathcal{D} \mid \boldsymbol{\theta})]\right]$$

The DIC is then:

$$\text{DIC} = -2 \log p(\mathcal{D} \mid \bar{\boldsymbol{\theta}}) + 2 p_D$$

The effective number of parameters $p_D$ is computed from posterior samples: it is twice the difference between the log-likelihood at the posterior mean and the posterior mean of the log-likelihood.

def compute_dic(log_likelihood_fn, posterior_samples):
    """
    Compute DIC from posterior samples.

    Parameters
    ----------
    log_likelihood_fn : callable
        Takes a parameter vector, returns log-likelihood
    posterior_samples : array, shape (n_samples, n_params)
        Samples from the posterior
    """
    # Log-likelihood at the posterior mean
    theta_bar = np.mean(posterior_samples, axis=0)
    ll_at_mean = log_likelihood_fn(theta_bar)

    # Posterior mean of the log-likelihood
    ll_samples = np.array([log_likelihood_fn(s) for s in posterior_samples])
    mean_ll = np.mean(ll_samples)

    # Effective number of parameters
    p_D = 2 * (ll_at_mean - mean_ll)

    # DIC
    dic = -2 * ll_at_mean + 2 * p_D
    return dic, p_D

⚠️

DIC has known problems. It is not invariant under reparameterisation, can give negative $p_D$ for non-Gaussian posteriors, and is not well defined for mixture models (where the posterior mean $\bar{\boldsymbol{\theta}}$ may correspond to a parameter combination that no single component uses). The WAIC (next section) fixes most of these issues and is generally preferred for Bayesian models.

4.4 The Widely Applicable Information Criterion (WAIC)

The WAIC (Watanabe, 2010) is the most theoretically principled information criterion for Bayesian models. It uses the full posterior predictive distribution, not just a point estimate:

$$\text{WAIC} = -2 \sum_{i=1}^{N} \log \left(\frac{1}{S}\sum_{s=1}^{S} p(y_i \mid \boldsymbol{\theta}_s)\right) + 2 p_\text{WAIC}$$

The first term is the log pointwise predictive density (lppd)---the log of the posterior-averaged likelihood for each data point, summed over data points. The second term is the effective number of parameters:

$$p_\text{WAIC} = \sum_{i=1}^{N} \text{Var}_\text{post}[\log p(y_i \mid \boldsymbol{\theta})]$$

This is the sum of the posterior variance of the log-likelihood contributions, computed pointwise.

def compute_waic(log_likelihood_pointwise):
    """
    Compute WAIC from pointwise log-likelihoods.

    Parameters
    ----------
    log_likelihood_pointwise : array, shape (n_samples, n_data)
        log_likelihood_pointwise[s, i] = log p(y_i | theta_s)
    """
    S, N = log_likelihood_pointwise.shape

    # lppd: log pointwise predictive density
    # Use logsumexp for numerical stability
    from scipy.special import logsumexp
    lppd_i = logsumexp(log_likelihood_pointwise, axis=0) - np.log(S)
    lppd = np.sum(lppd_i)

    # Effective number of parameters
    p_waic = np.sum(np.var(log_likelihood_pointwise, axis=0))

    # WAIC
    waic = -2 * (lppd - p_waic)
    return waic, p_waic, lppd

Why WAIC is preferred over DIC:

It uses the full posterior, not just the posterior mean.
It is invariant under reparameterisation.
The effective number of parameters $p_\text{WAIC}$ is always positive and well defined.
It asymptotically equals the leave-one-out cross-validation error (Section 6), giving it a clear predictive interpretation.

4.5 Summary of Information Criteria

Criterion	Penalty	Requires posterior?	Assumptions
AIC	$2p$	No (MLE only)	Large $N$, model near-Gaussian
AIC$_c$	$2p + \frac{2p(p+1)}{N-p-1}$	No	Finite-sample correction
BIC	$p \log N$	No (MLE only)	Large $N$, Laplace approximation
DIC	$2p_D$	Yes	Posterior mean is representative
WAIC	$2p_\text{WAIC}$	Yes	Pointwise likelihoods available

For problems where you have posterior samples (which you should, if you are doing Bayesian inference), WAIC is the default recommendation. If you only have a point estimate, use BIC for model identification or AIC for prediction.

5. The Occam Factor

5.1 Why Does Bayesian Inference Prefer Simpler Models?

The Bayesian evidence has a built-in preference for parsimony that emerges naturally from the mathematics, without any ad hoc penalty term. To see why, consider two models for the same data:

Model 1 has 1 parameter $\theta_1$ with a broad prior $p(\theta_1) = 1/\Delta\theta_1$
Model 2 has 2 parameters $(\theta_1, \theta_2)$ with broad priors $p(\theta_1, \theta_2) = 1/(\Delta\theta_1 \cdot \Delta\theta_2)$

Under the Laplace approximation, the evidence for each model is roughly:

$$p(\mathcal{D} \mid \mathcal{M}) \approx p(\mathcal{D} \mid \hat{\boldsymbol{\theta}}) \times \frac{\delta\theta_1 \cdot \delta\theta_2 \cdots \delta\theta_d}{\Delta\theta_1 \cdot \Delta\theta_2 \cdots \Delta\theta_d}$$

where $\delta\theta_j$ is the posterior width (the range of parameter values consistent with the data) and $\Delta\theta_j$ is the prior width. The ratio $\delta\theta_j / \Delta\theta_j$ is the Occam factor for parameter $j$: the fraction of the prior volume that is consistent with the data.

5.2 The Geometry of the Occam Factor

The evidence is the product of the best-fit likelihood and the Occam factor:

$$p(\mathcal{D} \mid \mathcal{M}) \approx \mathcal{L}_\text{max} \times \underbrace{\prod_{j=1}^{d} \frac{\delta\theta_j}{\Delta\theta_j}}_{\text{Occam factor}}$$

A complex model achieves a higher $\mathcal{L}_\text{max}$ (it fits the data better), but it pays a cost through the Occam factor: each extra parameter contributes a factor $\delta\theta_j / \Delta\theta_j < 1$. If the extra parameter does not substantially improve the fit, the Occam penalty outweighs the improvement, and the simpler model wins.

This is the automatic Occam's razor of Bayesian inference. It does not need to be imposed---it is a consequence of probability theory. The GP marginal likelihood from lecture 11 is a continuous example of the same mechanism: the complexity-penalty term $-\frac{1}{2}\log|\mathbf{K} + \sigma_n^2 \mathbf{I}|$ is the Occam factor for the infinite-dimensional function space.

ℹ️

The Occam factor explains the linear-vs-quadratic result. In Section 2.4, the quadratic model has a third parameter $\theta_2$ with a prior width of $\Delta\theta_2 = 100$ (from the $\mathcal{N}(0, 100^2)$ prior). If the data constrain $\theta_2$ to a narrow posterior width $\delta\theta_2 \approx 0.05$, the Occam factor for this parameter alone is $0.05/100 = 5 \times 10^{-4}$. The quadratic needs to improve the fit by a factor of $1/5 \times 10^{-4} = 2000$ in likelihood to overcome this penalty. With noisy data and a small true curvature, it cannot.

5.3 The Role of Priors in Model Comparison

This analysis reveals something important: the evidence depends on the prior widths, not just the prior shape. Widening the prior on $\theta_2$ from $\mathcal{N}(0, 100^2)$ to $\mathcal{N}(0, 1000^2)$ makes the Occam penalty ten times worse for the quadratic model, even though the posterior is essentially unchanged.

For parameter estimation, vague priors are harmless---the posterior is dominated by the likelihood. For model comparison, vague priors are penalised, because they waste prior volume on parameter values that the data exclude. This means:

Priors must be chosen carefully for model comparison---they are part of the model.
Improper priors (e.g., flat over $(-\infty, +\infty)$) give a Bayes factor of zero or infinity and cannot be used.
The sensitivity of the Bayes factor to the prior choice should be assessed (a prior sensitivity analysis).

⚠️

Improper priors make the evidence undefined. If $p(\boldsymbol{\theta} \mid \mathcal{M})$ does not integrate to a finite number, neither does the evidence integral. This is fine for parameter estimation (where $p(y)$ cancels) but fatal for model comparison. If you want to compute Bayes factors, every model must have proper, normalised priors.

6. Cross-Validation

Cross-validation takes a fundamentally different approach: instead of computing the probability of the data under the model, it directly measures the model's predictive performance on held-out data.

6.1 Leave-One-Out Cross-Validation (LOO-CV)

For each data point $y_i$, fit the model to all data except $y_i$, and compute the predictive probability of the held-out point:

$$p(y_i \mid \mathcal{D}_{-i}) = \int p(y_i \mid \boldsymbol{\theta})p(\boldsymbol{\theta} \mid \mathcal{D}_{-i})\mathrm{d}\boldsymbol{\theta}$$

where $\mathcal{D}_{-i}$ is the dataset with point $i$ removed. The LOO-CV score is:

$$\text{LOO-CV} = \sum_{i=1}^{N} \log p(y_i \mid \mathcal{D}_{-i})$$

Higher is better (more probable predictions). In principle, this requires refitting the model $N$ times---once for each held-out point.

6.2 Pareto-Smoothed Importance Sampling (PSIS-LOO)

Refitting $N$ times is expensive. Vehtari, Gelman & Gabry (2017) showed that LOO-CV can be approximated from a single posterior fit using importance sampling:

$$p(y_i \mid \mathcal{D}_{-i}) \approx \frac{1}{\frac{1}{S}\sum_{s=1}^{S} \frac{1}{p(y_i \mid \boldsymbol{\theta}_s)}}$$

The weights $w_s^{(i)} \propto 1/p(y_i \mid \boldsymbol{\theta}_s)$ are importance weights that reweight the full-data posterior to approximate the leave-one-out posterior. Raw importance sampling is unstable (this is essentially the harmonic mean estimator in disguise for each data point), but Pareto smoothing of the weight tails stabilises it dramatically.

The arviz library implements this:

import arviz as az

# Assuming you have an InferenceData object from PyMC:
# loo_result = az.loo(inference_data, pointwise=True)
# print(loo_result)

# Manual computation from pointwise log-likelihoods:
def psis_loo_manual(log_lik_pointwise):
    """
    Simplified PSIS-LOO.

    Parameters
    ----------
    log_lik_pointwise : array, shape (n_samples, n_data)
    """
    S, N = log_lik_pointwise.shape
    loo_i = np.zeros(N)

    for i in range(N):
        # Importance weights (in log space)
        log_weights = -log_lik_pointwise[:, i]
        # Stabilise by subtracting the max
        log_weights -= np.max(log_weights)
        weights = np.exp(log_weights)
        weights /= np.sum(weights)

        # LOO predictive density (importance-weighted)
        from scipy.special import logsumexp
        loo_i[i] = logsumexp(log_lik_pointwise[:, i],
                              b=weights)

    return np.sum(loo_i), loo_i

ℹ️

PSIS-LOO gives you diagnostics for free. The Pareto shape parameter $\hat{k}$ for each data point tells you how reliable the importance sampling approximation is. If $\hat{k} > 0.7$ for a point, the approximation is unreliable and you should refit without that point. Points with large $\hat{k}$ are highly influential observations---they are worth examining regardless of the model comparison question.

6.3 K-Fold Cross-Validation

When PSIS-LOO is unreliable (too many high-$\hat{k}$ points), $K$-fold cross-validation is the fallback. Partition the data into $K$ disjoint subsets, and for each fold, fit the model to $K-1$ subsets and evaluate on the held-out subset. The cross-validation score is the sum of the held-out log-likelihoods across folds.

$K = 10$ is conventional. $K = N$ recovers LOO-CV exactly.

6.4 Connection Between LOO-CV and WAIC

LOO-CV and WAIC are asymptotically equivalent: as $N \to \infty$, they converge to the same quantity. For finite samples, LOO-CV (via PSIS) is generally more reliable because it does not rely on the asymptotic approximation. In practice, for well-behaved models, the numerical differences between WAIC and PSIS-LOO are small.

This means you have two routes to the same destination. WAIC is computed directly from pointwise log-likelihoods and requires no importance sampling. PSIS-LOO is more robust but requires the importance weight machinery. Use whichever is easier to compute for your problem, and cross-check with the other if the result is borderline.

7. Posterior Predictive Model Checking

Model comparison tells you which model is relatively better. It does not tell you whether any model is absolutely adequate. A model can win a comparison and still be a poor description of the data. Posterior predictive checks (PPCs) fill this gap.

You encountered PPCs in lecture 4 (the Bayesian workflow) and used them implicitly in lecture 5 (residual diagnostics). The idea is simple: if the model is correct, data simulated from the fitted model should look like the observed data.

7.1 The Posterior Predictive Distribution

The posterior predictive distribution for new data $\tilde{y}$ is:

$$p(\tilde{y} \mid \mathcal{D}) = \int p(\tilde{y} \mid \boldsymbol{\theta})p(\boldsymbol{\theta} \mid \mathcal{D})\mathrm{d}\boldsymbol{\theta}$$

In practice, you generate replicated datasets by:

Drawing $\boldsymbol{\theta}_s$ from the posterior samples
Simulating $\tilde{y}_s \sim p(\tilde{y} \mid \boldsymbol{\theta}_s)$ from the likelihood

Each replicated dataset $\tilde{y}_s$ is one possible realisation of new data under the fitted model.

7.2 Test Statistics

A PPC compares a test statistic $T(y)$ computed on the real data to the distribution $T(\tilde{y}_s)$ computed on replicated data. If $T(y)$ lies in the tails of the replicated distribution, the model fails to capture whatever feature $T$ measures.

Common test statistics for regression problems:

Mean and variance of residuals
Maximum absolute residual (sensitivity to outliers)
Autocorrelation of residuals at lag 1 (sensitivity to correlated structure)
Kolmogorov-Smirnov statistic comparing residual distribution to $\mathcal{N}(0,1)$

def posterior_predictive_check(y_obs, y_rep_samples, test_statistic):
    """
    Perform a posterior predictive check.

    Parameters
    ----------
    y_obs : array, shape (N,)
        Observed data
    y_rep_samples : array, shape (S, N)
        Replicated datasets from the posterior predictive
    test_statistic : callable
        Function that takes a data vector and returns a scalar

    Returns
    -------
    T_obs : float
        Test statistic on observed data
    T_rep : array
        Test statistics on replicated datasets
    p_value : float
        Bayesian p-value (fraction of T_rep >= T_obs)
    """
    T_obs = test_statistic(y_obs)
    T_rep = np.array([test_statistic(yr) for yr in y_rep_samples])
    p_value = np.mean(T_rep >= T_obs)
    return T_obs, T_rep, p_value

# Example: check whether the model captures the variance
import matplotlib.pyplot as plt

# Simulate replicated data (assuming posterior_samples and x are defined)
y_rep = np.array([
    np.polyval(np.polyfit(x, y, 1, w=1/np.full(N, sigma_y)), x)
    + np.random.normal(0, sigma_y, N)
    for _ in range(1000)
])

T_obs, T_rep, pval = posterior_predictive_check(
    y, y_rep, test_statistic=np.var
)
print(f"Observed variance: {T_obs:.2f}")
print(f"Replicated mean:   {np.mean(T_rep):.2f}")
print(f"Bayesian p-value:  {pval:.3f}")

⚠️

A Bayesian p-value near 0 or 1 indicates model failure. Values near 0.5 indicate consistency. But unlike a frequentist p-value, this is not a hypothesis test---there is no fixed significance threshold. The purpose is diagnostic: if the p-value is extreme for a particular test statistic, investigate why. Does the model miss a trend? Underestimate the scatter? Fail to capture outliers? The diagnostic guides model improvement, not accept/reject decisions.

7.3 PPCs and Model Comparison Together

The correct workflow uses both tools:

PPC first: check that each candidate model adequately describes the data. If a model fails basic PPCs, eliminate it before comparing.
Compare survivors: use Bayes factors, information criteria, or cross-validation to rank the models that pass the PPCs.
Iterate: if no model passes, build a better one using the PPC failures as guidance (this is the workflow loop from lecture 4).

8. Common Pitfalls

8.1 The Lindley Paradox

The Lindley paradox arises when a frequentist test rejects the null hypothesis (small $p$-value) but the Bayesian analysis strongly favours the null (large Bayes factor for $\mathcal{M}_0$). This happens when:

The data show a statistically significant but scientifically small effect.
The prior on the alternative model's parameter is very broad.

Under these conditions, the Occam factor destroys the alternative model. The broad prior wastes so much volume on parameter values far from the data that the evidence is low, even though the maximum likelihood is higher.

The resolution is not that one framework is right and the other wrong. They are answering different questions. The frequentist test asks: "Is the effect distinguishable from zero?" The Bayesian comparison asks: "Given my prior beliefs about plausible effect sizes, does adding this parameter improve the model enough to justify the added complexity?"

In practice, if you encounter the Lindley paradox, the first thing to check is your prior. A prior of $\mathcal{N}(0, 10^6)$ on an effect that could plausibly be at most $\mathcal{O}(1)$ is not "uninformative"---it is misinformative for model comparison.

8.2 Sensitivity to Prior Choice

Because the evidence depends on the prior volume (Section 5.3), Bayes factors can be sensitive to the prior specification in ways that do not affect parameter estimation. Best practices:

Use informative priors motivated by physics. "What range of values is plausible for this parameter?" is a question you can usually answer.
Report results for multiple prior choices. If the Bayes factor changes sign when you double the prior width, the data are not informative enough for model comparison.
Consider the Savage-Dickey density ratio for nested models, which expresses the Bayes factor in terms of the posterior and prior density at a single point (the restriction defining the simpler model).

8.3 Improper Priors

As noted in Section 5.3, improper priors produce undefined evidence values. If you are using flat (uniform over $\mathbb{R}$) priors for convenience in parameter estimation, you must replace them with proper priors before computing Bayes factors. A common strategy is to use the same proper prior in both models for shared parameters, so that the Occam penalty applies only to the parameters that differ.

8.4 Comparing Non-Nested Models

AIC and BIC make no assumption about whether models are nested (i.e., whether one is a special case of the other). Neither does the Bayes factor. You can compare a GP model against a polynomial, or a mixture model against a single-component model, or a Student-$t$ likelihood against a Gaussian likelihood. The only requirement is that all models define a proper likelihood for the same observed data $\mathcal{D}$.

However, be cautious when comparing models that parameterise the data differently (e.g., one model fits magnitudes and another fits fluxes). The likelihoods must be on the same scale---which means including the appropriate Jacobian if you transform the data.

8.5 The Number-of-Components Problem

In mixture models (lecture 13), choosing $K$ is a model comparison problem with a special complication: the models are not nested in the usual sense, because the labelling of components is arbitrary (the "label switching" problem). BIC works reasonably well despite this, as we saw in lecture 13. For a fully Bayesian treatment, non-parametric mixture models (e.g., Dirichlet process mixtures) let the data determine $K$, but they are computationally expensive and outside the scope of this course.

9. A Practical Decision Framework

Given the profusion of methods, here is a practical guide for deciding which to use:

9.1 The Quick Check

For a quick comparison between a small number of models:

Compute AIC and BIC for each model. These require only the maximum log-likelihood and the parameter count. If they agree, you probably have your answer.
If they disagree, the data are in a regime where the penalty matters. Proceed to a more careful analysis.

9.2 The Bayesian Analysis

If you have posterior samples (which you should for non-trivial problems):

Compute WAIC or PSIS-LOO. These are the most principled information criteria and can be computed directly from posterior samples. Use arviz for convenience.
Run posterior predictive checks. Ensure all candidate models pass basic adequacy tests before comparing them.
If you need the actual evidence (e.g., for reporting a Bayes factor), use nested sampling with dynesty or UltraNest.

9.3 The Full Treatment

For high-stakes model comparisons (e.g., claiming a detection):

Nested sampling for the evidence, with error estimates.
Prior sensitivity analysis: how does the Bayes factor change with the prior width?
Multiple comparison metrics: do AIC/BIC, WAIC, LOO-CV, and the Bayes factor agree?
PPCs: does the winning model actually describe the data well?

# Complete example: comparing a linear vs quadratic model with multiple criteria

from scipy.stats import norm

np.random.seed(42)
N = 30
sigma_y = 2.0
x = np.sort(np.random.uniform(-5, 5, N))
y_true = 2.0 + 0.8 * x + 0.05 * x**2
y = y_true + np.random.normal(0, sigma_y, N)

results = {}
for name, degree in [("Linear", 1), ("Quadratic", 2), ("Cubic", 3)]:
    coeffs = np.polyfit(x, y, degree, w=1.0/np.full(N, sigma_y))
    y_pred = np.polyval(coeffs, x)
    p = degree + 1

    # Maximum log-likelihood
    log_L = np.sum(norm.logpdf(y, loc=y_pred, scale=sigma_y))

    # Information criteria
    aic = -2 * log_L + 2 * p
    aicc = aic + 2 * p * (p + 1) / (N - p - 1)
    bic = -2 * log_L + p * np.log(N)

    # Reduced chi-squared
    chi2_red = np.sum(((y - y_pred) / sigma_y)**2) / (N - p)

    results[name] = {
        "p": p, "log_L": log_L,
        "AIC": aic, "AICc": aicc, "BIC": bic,
        "chi2_red": chi2_red
    }

# Print comparison table
print(f"{'Model':<12} {'p':>3} {'log L':>8} {'AIC':>8} "
      f"{'AICc':>8} {'BIC':>8} {'chi2/dof':>8}")
print("-" * 60)
for name, r in results.items():
    print(f"{name:<12} {r['p']:>3} {r['log_L']:>8.1f} {r['AIC']:>8.1f} "
          f"{r['AICc']:>8.1f} {r['BIC']:>8.1f} {r['chi2_red']:>8.2f}")

10. Astronomical Applications

10.1 Exoplanet Detection: Signal or Noise?

The canonical model comparison problem in exoplanet science is: does the radial velocity data contain a planetary signal, or is a flat line (plus noise) sufficient?

Model 0: $v(t) = v_0 + \varepsilon(t)$, a constant velocity plus white noise
Model 1: $v(t) = v_0 + K \sin(2\pi t / P + \phi) + \varepsilon(t)$, a Keplerian signal

Model 1 has three extra parameters ($K$, $P$, $\phi$). The maximum likelihood of Model 1 is always at least as high as Model 0. The question is whether the improvement justifies the extra parameters.

In practice, the noise is often correlated (stellar activity), so $\varepsilon(t)$ is modelled with a GP (lecture 11). The model comparison then becomes: does adding a Keplerian to the GP improve the evidence? This is where nested sampling earns its keep---the evidence integrals are high-dimensional and multimodal (the period $P$ creates sharp likelihood peaks), and MCMC-based evidence estimates are unreliable.

10.2 How Many Components?

In lecture 13 we used BIC to determine the number of Gaussian components in a mixture model. The Bayesian approach uses the evidence:

$$p(\mathcal{D} \mid K) = \int p(\mathcal{D} \mid \boldsymbol{\theta}_K, K)p(\boldsymbol{\theta}_K \mid K)\mathrm{d}\boldsymbol{\theta}_K$$

For mixture models, this integral is particularly challenging because of label switching (the likelihood is invariant under permutations of the component labels). Nested sampling handles this gracefully by exploring the full prior volume, including all label-switched modes.

The practical workflow:

Fit the mixture model for $K = 1, 2, \ldots, K_\text{max}$ using nested sampling.
Compare the evidences (or equivalently, the Bayes factors $B_{K,K-1}$).
Select the $K$ with the highest evidence, subject to a PPC confirming that the model describes the data.

10.3 Model Selection for Spectral Fitting

When fitting an astronomical spectrum, you might consider:

Is the absorption line a single Gaussian or a double (blended) feature?
Does the continuum require a polynomial of degree 2 or 3?
Is there a broad emission component underlying the narrow absorption?

Each question is a model comparison. The BIC is fast and often sufficient for these problems, because the models are simple (few parameters), the data are abundant (many spectral pixels), and the likelihoods are well approximated by Gaussians.

# Example: single vs double Gaussian absorption line
from scipy.optimize import minimize

def single_gaussian(x, theta):
    """Single Gaussian absorption line on a flat continuum."""
    A, mu, sigma, cont = theta
    return cont - A * np.exp(-0.5 * ((x - mu) / sigma)**2)

def double_gaussian(x, theta):
    """Double Gaussian absorption (blended line)."""
    A1, mu1, sig1, A2, mu2, sig2, cont = theta
    g1 = A1 * np.exp(-0.5 * ((x - mu1) / sig1)**2)
    g2 = A2 * np.exp(-0.5 * ((x - mu2) / sig2)**2)
    return cont - g1 - g2

# Simulated spectral data
np.random.seed(123)
wavelength = np.linspace(5000, 5020, 200)
noise_level = 0.02
# True model: two blended lines
true_flux = (1.0
    - 0.3 * np.exp(-0.5 * ((wavelength - 5008) / 0.8)**2)
    - 0.15 * np.exp(-0.5 * ((wavelength - 5010.5) / 0.6)**2))
flux = true_flux + np.random.normal(0, noise_level, len(wavelength))

# Fit single Gaussian
def neg_ll_single(theta):
    model = single_gaussian(wavelength, theta)
    return -np.sum(norm.logpdf(flux, loc=model, scale=noise_level))

res1 = minimize(neg_ll_single, x0=[0.3, 5009, 1.5, 1.0], method="Nelder-Mead")
ll_single = -res1.fun

# Fit double Gaussian
def neg_ll_double(theta):
    model = double_gaussian(wavelength, theta)
    return -np.sum(norm.logpdf(flux, loc=model, scale=noise_level))

res2 = minimize(neg_ll_double,
                x0=[0.3, 5008, 0.8, 0.15, 5010.5, 0.6, 1.0],
                method="Nelder-Mead")
ll_double = -res2.fun

# Compare with BIC
bic_single = -2 * ll_single + 4 * np.log(len(wavelength))
bic_double = -2 * ll_double + 7 * np.log(len(wavelength))
delta_bic = bic_single - bic_double  # positive favours double

print(f"Single Gaussian: log L = {ll_single:.1f}, BIC = {bic_single:.1f}")
print(f"Double Gaussian: log L = {ll_double:.1f}, BIC = {bic_double:.1f}")
print(f"Delta BIC = {delta_bic:.1f} (positive favours double)")

11. Connections to Earlier Lectures

Model comparison is not a standalone topic. It connects to nearly everything in this course:

The Bayesian workflow (lecture 4) includes model checking as step 6. This lecture formalises the comparison step and adds quantitative tools (evidence, information criteria) to the qualitative checks (PPCs) we have been using throughout.

Fitting a line (lectures 5--6): the reduced $\chi^2$ diagnostic ($\chi^2_\nu \approx 1$) is a rudimentary model check. Formally, it answers the question "is this model consistent with the data?" but not "is this model better than that one?"

The GP marginal likelihood (lectures 10--11) is exactly the Bayesian evidence for a GP model. When you optimise GP hyperparameters by maximising $\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\psi})$, you are doing Type-II model comparison: selecting the kernel hyperparameters that best explain the data, with the Occam factor automatically penalising overly flexible kernels.

Hierarchical models (lecture 12): model comparison applies at the hyperparameter level. Does the population need a log-normal distribution or a mixture? The evidence for each choice integrates over all group-level parameters and the hyperparameters.

Mixture models (lecture 13): choosing $K$ is model comparison. The BIC worked well in Section 9 of that lecture. Now you understand why: BIC approximates $-2 \log p(\mathcal{D} \mid K)$, and the $p \log N$ penalty approximates the Occam factor for each additional component.

12. Summary

The model comparison problem asks: given competing models, which should you prefer? The core tension is between fit quality (complex models always fit better) and parsimony (simple models generalise better).

Bayesian model comparison uses the marginal likelihood (evidence) $p(\mathcal{D} \mid \mathcal{M}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}) \mathrm{d}\boldsymbol{\theta}$, which averages the likelihood over the prior. The ratio of evidences is the Bayes factor, interpreted via the Jeffreys scale. The evidence naturally penalises complexity through the Occam factor: the fraction of the prior volume consistent with the data.

Computing the evidence is the hard part. The harmonic mean estimator is useless. Nested sampling (use dynesty or UltraNest) is the standard in astronomy. Thermodynamic integration and the Laplace approximation are alternatives for specific problems.

Information criteria approximate model comparison without computing the full evidence. AIC estimates predictive accuracy; BIC approximates the evidence; DIC and WAIC extend these to Bayesian models with effective parameter counts. WAIC is the default recommendation when posterior samples are available.

Cross-validation directly measures predictive performance. LOO-CV, efficiently approximated via PSIS-LOO, is asymptotically equivalent to WAIC and provides per-point diagnostics that identify influential observations.

Posterior predictive checks test whether a model is absolutely adequate, not just relatively better. Always PPC before comparing. A model that fails basic PPCs should not be in the comparison set.

Common pitfalls: the Lindley paradox (broad priors destroy the evidence for the alternative), sensitivity to prior width (priors matter for model comparison even when they do not matter for parameter estimation), improper priors (make the evidence undefined), and the temptation to use the harmonic mean estimator (do not).

The practical workflow: compute AIC/BIC for a quick check; use WAIC or PSIS-LOO for a careful Bayesian analysis; run nested sampling for a rigorous evidence computation; and always run PPCs to verify adequacy.