Hierarchical Models

Every model we have built so far treats each dataset as an island: we observe data, define a likelihood, choose priors, and sample the posterior for that one system. But in astronomy we almost never study a single object in isolation. We observe populations---hundreds of stars, thousands of galaxies, millions of transients---and the scientific question is about the population, not any individual member. Hierarchical models are the principled Bayesian framework for learning about populations from noisy measurements of individuals, and they are the single most important modelling idea you will encounter in modern astronomical data analysis.

This lecture develops hierarchical models from first principles. We start with the conceptual motivation (why fitting each object independently is wasteful, and why averaging everything together is wrong), build the mathematical machinery (hyperparameters, the joint posterior, marginalisation), work through a complete astronomical example in Python, and end with the computational realities you will face when fitting these models in practice.

The punchline is reassuringly familiar: a hierarchical model is just the Bayesian workflow from lectures 4--6 applied at two levels instead of one. If you can write a generative model and sample a posterior, you already know how to build a hierarchical model. The hard part is recognising when you should.

1. The Problem: Populations of Objects

Suppose you measure the radial velocity semi-amplitude $K_i$ for each of $J$ exoplanets, each with its own measurement uncertainty $\sigma_i$. You want to know: what is the distribution of semi-amplitudes in the underlying population?

You have three options:

1.1 No Pooling (Separate Estimates)

Fit each planet independently. Planet $j$ gives you an estimate $\hat{K}_j \pm \sigma_j$. You then have $J$ separate posteriors, each using only its own data.

The problem: for planets with few observations or large $\sigma_j$, the individual estimates are noisy and unreliable. You have $J$ weakly constrained posteriors and no way to borrow strength across them.

1.2 Complete Pooling (Single Estimate)

Ignore the fact that different planets have different $K$ values. Assume all planets share a single $K$ and estimate it from all data simultaneously.

The problem: this throws away real variation. If the true $K$ values span a range, complete pooling gives you a single number that does not describe any individual planet well and does not describe the population accurately either.

1.3 Partial Pooling (Hierarchical Model)

Assume the individual $K_j$ values are drawn from a population distribution with unknown parameters (e.g., a log-normal with unknown mean and variance). Fit the population parameters and the individual $K_j$ values simultaneously.

This is the hierarchical model. Each planet's estimate is informed by its own data and by what the population tells you is plausible. Noisy estimates get "shrunk" toward the population mean; well-measured ones stay close to their data. The population distribution is simultaneously learned from all the individual estimates.

ℹ️

Partial pooling is not a compromise---it is optimal. In a hierarchical model, the degree of pooling is not a tuning parameter you set by hand. It emerges automatically from the posterior. Objects with precise measurements contribute more to the population estimate, and the population estimate has more influence on poorly measured objects. The data decide how much pooling is appropriate.

2. The Structure of a Hierarchical Model

2.1 The DAG

The defining feature of a hierarchical model is its layered structure. In DAG notation:

$$\boldsymbol{\phi} \longrightarrow \theta_j \longrightarrow y_{ij}$$

where:

$\boldsymbol{\phi}$ are the hyperparameters --- parameters of the population distribution (e.g., the mean and standard deviation of the distribution from which the $\theta_j$ are drawn)
$\theta_j$ are the group-level parameters --- one set per group $j$ (e.g., the true radial velocity amplitude of planet $j$)
$y_{ij}$ are the observations --- the $i$-th measurement of group $j$ (e.g., the $i$-th radial velocity measurement of planet $j$)

The "plate" around the group-level parameters and observations indicates replication: there are $J$ groups, each potentially with $n_j$ observations.

In the most common case with a single observation per group (which we will develop in detail), the DAG simplifies to:

$$\boldsymbol{\phi} \longrightarrow \theta_j \longrightarrow y_j$$

where $y_j$ is the single noisy measurement of the $j$-th object. This is the structure we will use for our main example.

2.2 The Generative Model

A hierarchical model is fully specified by writing down the generative process---the recipe for simulating data. For the single-observation-per-group case:

Draw hyperparameters from their prior: $\boldsymbol{\phi} \sim p(\boldsymbol{\phi})$
For each group $j = 1, \ldots, J$: draw the true parameter from the population distribution: $\theta_j \sim p(\theta_j \mid \boldsymbol{\phi})$
For each group $j$: generate the observation: $y_j \sim p(y_j \mid \theta_j)$

The population distribution $p(\theta_j \mid \boldsymbol{\phi})$ is the key ingredient. It encodes the structure you believe exists across the population. Common choices include:

Population model	$p(\theta_j \mid \boldsymbol{\phi})$	Hyperparameters $\boldsymbol{\phi}$	When to use
Normal	$\mathcal{N}(\mu, \sigma^2)$	$\mu, \sigma$	Symmetric, unbounded quantities
Log-normal	$\text{LogNormal}(\mu, \sigma^2)$	$\mu, \sigma$	Positive quantities spanning decades
Beta	$\text{Beta}(\alpha, \beta)$	$\alpha, \beta$	Fractions, probabilities
Exponential	$\text{Exp}(\lambda)$	$\lambda$	Waiting times, scale-free positive quantities

2.3 The Joint Posterior

Applying Bayes' theorem to the full model, the joint posterior over all unknowns is:

$$p(\boldsymbol{\phi}, \boldsymbol{\theta} \mid \mathbf{y}) \propto p(\boldsymbol{\phi}) \prod_{j=1}^{J} p(\theta_j \mid \boldsymbol{\phi}) p(y_j \mid \theta_j)$$

where $\boldsymbol{\theta} = \\{\theta_1, \ldots, \theta_J\\}$ and $\mathbf{y} = \\{y_1, \ldots, y_J\\}$.

Read the three factors from right to left:

$p(y_j \mid \theta_j)$ --- the likelihood of the data given the group parameter (this is the same likelihood you would write for a non-hierarchical model of a single object)
$p(\theta_j \mid \boldsymbol{\phi})$ --- the population distribution (this acts as a prior on $\theta_j$, but its parameters are unknown and learned from data)
$p(\boldsymbol{\phi})$ --- the hyperprior (the prior on the population parameters)

ℹ️

The population distribution is a prior whose parameters are learned from data. This is what makes hierarchical models powerful: instead of choosing a fixed prior for $\theta_j$, you let the data determine what prior is appropriate. The "prior" on each $\theta_j$ adapts to the population. This is sometimes called an empirical Bayes approach when the hyperparameters are point-estimated, but in full Bayesian inference we sample over $\boldsymbol{\phi}$ as well.

3. Shrinkage

The most important and initially counterintuitive property of hierarchical models is shrinkage: the posterior estimates of the individual $\theta_j$ are pulled toward the population mean relative to the raw data estimates.

3.1 Why Shrinkage Happens

Consider the simplest case: $y_j \mid \theta_j \sim \mathcal{N}(\theta_j, \sigma_j^2)$ with known measurement uncertainties $\sigma_j$, and a population distribution $\theta_j \sim \mathcal{N}(\mu, \tau^2)$.

For fixed $\mu$ and $\tau$, the posterior for a single $\theta_j$ is:

$$p(\theta_j \mid y_j, \mu, \tau) \propto \mathcal{N}(y_j \mid \theta_j, \sigma_j^2) \times \mathcal{N}(\theta_j \mid \mu, \tau^2)$$

This is a product of two Gaussians in $\theta_j$---which we know from lecture 2 gives another Gaussian. The posterior mean is:

$$\hat{\theta}_j = \frac{\frac{y_j}{\sigma_j^2} + \frac{\mu}{\tau^2}}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}} = \underbrace{\frac{\tau^2}{\sigma_j^2 + \tau^2}}_{w_j} y_j + \underbrace{\frac{\sigma_j^2}{\sigma_j^2 + \tau^2}}_{1 - w_j} \mu$$

This is a precision-weighted average of the individual measurement $y_j$ and the population mean $\mu$, with weight $w_j = \tau^2 / (\sigma_j^2 + \tau^2)$.

3.2 The Shrinkage Weight

The weight $w_j$ controls how much the estimate for object $j$ relies on its own data versus the population:

When $\sigma_j \ll \tau$ (precise measurement, broad population): $w_j \approx 1$, so $\hat{\theta}_j \approx y_j$. The data are trusted.
When $\sigma_j \gg \tau$ (noisy measurement, tight population): $w_j \approx 0$, so $\hat{\theta}_j \approx \mu$. The population prior dominates.

Shrinkage is strongest for the noisiest observations. This is exactly what you want: when you have a noisy measurement, the best guess is to trust the population average more than the individual data point. When you have a precise measurement, you trust the data.

3.3 Shrinkage Is a Feature

At first glance, shrinkage feels like cheating---you are systematically moving estimates away from the data. But shrinkage provably reduces the total estimation error across the population. This is the James-Stein phenomenon: for three or more groups, the shrinkage estimator has lower total mean squared error than the individual maximum likelihood estimates, regardless of the true parameter values.

The intuition is simple: extreme observations are more likely to be extreme because of noise than because the true value is extreme. Shrinking them toward the mean corrects for this on average.

⚠️

Shrinkage does not mean you believe all objects are the same. The posterior for $\tau$ (the population spread) is jointly estimated with the individual $\theta_j$. If the data genuinely show a wide spread, $\tau$ will be large, $w_j$ will be close to 1, and there will be little shrinkage. The model adapts.

4. A Worked Example: Stellar Metallicities

We now build a complete hierarchical model for a concrete astronomical problem. Suppose we observe $J = 20$ stars in an open cluster and measure each star's metallicity $[\text{Fe/H}]_j$ with known measurement uncertainty $\sigma_j$. The scientific question: what is the mean metallicity and intrinsic spread of the cluster?

4.1 Generating Simulated Data

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# Population (hyperparameters) - the truth we want to recover
mu_true = -0.15        # true population mean [Fe/H]
tau_true = 0.08        # true intrinsic spread in [Fe/H]

# Number of stars
J = 20

# Draw true metallicities from the population
theta_true = np.random.normal(mu_true, tau_true, J)

# Measurement uncertainties (heteroscedastic - different for each star)
sigma = np.random.uniform(0.03, 0.15, J)

# Observed metallicities (true + noise)
y_obs = theta_true + np.random.normal(0, sigma)

# Print a summary
print(f"True population: mu = {mu_true:.2f}, tau = {tau_true:.2f}")
print(f"Naive sample mean: {np.mean(y_obs):.3f}")
print(f"Naive sample std:  {np.std(y_obs, ddof=1):.3f}")
print(f"Measurement uncertainties range: {sigma.min():.3f} to {sigma.max():.3f}")

4.2 Visualising the Problem

Before writing any model, look at the data. Each star has an observed metallicity and an error bar. Some error bars are small (high-resolution spectroscopy), others are large (low-resolution or low-S/N).

fig, ax = plt.subplots(figsize=(10, 5))

# Sort by observed metallicity for visual clarity
order = np.argsort(y_obs)
y_sorted = y_obs[order]
sigma_sorted = sigma[order]
theta_sorted = theta_true[order]

stars = np.arange(J)
ax.errorbar(stars, y_sorted, yerr=sigma_sorted, fmt='o', color='black',
            capsize=3, label='Observed [Fe/H]')
ax.scatter(stars, theta_sorted, marker='s', color='red', s=30,
           zorder=5, alpha=0.7, label=r'True $\theta_j$')
ax.axhline(mu_true, color='blue', linestyle='--', linewidth=1.5,
           label=rf'True $\mu$ = {mu_true:.2f}')
ax.fill_between([-1, J], mu_true - tau_true, mu_true + tau_true,
                alpha=0.15, color='blue', label=rf'True $\tau$ = {tau_true:.2f}')
ax.set_xlabel('Star index (sorted by observed [Fe/H])')
ax.set_ylabel('[Fe/H]')
ax.set_title('Open Cluster Metallicities: Data')
ax.legend(loc='upper left', fontsize=9)
ax.set_xlim(-0.5, J - 0.5)
plt.tight_layout()
plt.show()

The challenge is visible in this plot: the scatter in the observed values is a mixture of the true intrinsic spread $\tau$ and the measurement noise $\sigma_j$. Disentangling the two is exactly what the hierarchical model does.

4.3 The Model

The generative model, written as a DAG:

$$\mu, \tau \longrightarrow \theta_j \longrightarrow y_j \longleftarrow \sigma_j$$

In equations:

$$\mu \sim \mathcal{N}(0, 1) \qquad \text{(weakly informative hyperprior)}$$

$$\tau \sim \text{HalfNormal}(0.5) \qquad \text{(hyperprior on population spread)}$$

$$\theta_j \sim \mathcal{N}(\mu, \tau^2) \qquad j = 1, \ldots, J \qquad \text{(population distribution)}$$

$$y_j \sim \mathcal{N}(\theta_j, \sigma_j^2) \qquad j = 1, \ldots, J \qquad \text{(likelihood)}$$

The joint posterior is:

$$p(\mu, \tau, \boldsymbol{\theta} \mid \mathbf{y}, \boldsymbol{\sigma}) \propto p(\mu) p(\tau) \prod_{j=1}^{J} \mathcal{N}(\theta_j \mid \mu, \tau^2) \mathcal{N}(y_j \mid \theta_j, \sigma_j^2)$$

This model has $J + 2$ parameters: the two hyperparameters $(\mu, \tau)$ and the $J$ individual metallicities $\theta_j$. For $J = 20$, that is 22 parameters total.

4.4 Implementing the Log-Posterior

We write the log-posterior directly and sample with emcee, which you already know from lecture 9.

def log_prior(mu, tau):
    """Log-prior on hyperparameters."""
    if tau <= 0:
        return -np.inf
    lp = -0.5 * mu**2                            # N(0, 1) on mu
    lp += -0.5 * (tau / 0.5)**2 + np.log(2)      # HalfNormal(0.5) on tau
    return lp

def log_likelihood(theta, y, sigma):
    """Log-likelihood: product over stars."""
    return -0.5 * np.sum((y - theta)**2 / sigma**2 + np.log(2 * np.pi * sigma**2))

def log_population(theta, mu, tau):
    """Log-probability of theta under the population distribution."""
    return -0.5 * np.sum((theta - mu)**2 / tau**2 + np.log(2 * np.pi * tau**2))

def log_posterior(params, y, sigma, J):
    """Full hierarchical log-posterior.

    params: [mu, tau, theta_1, ..., theta_J]
    """
    mu = params[0]
    tau = params[1]
    theta = params[2:]

    lp = log_prior(mu, tau)
    if not np.isfinite(lp):
        return -np.inf
    lp += log_population(theta, mu, tau)
    lp += log_likelihood(theta, y, sigma)
    return lp

4.5 Sampling with emcee

import emcee

ndim = 2 + J  # mu, tau, theta_1, ..., theta_J
nwalkers = 64

# Initialise walkers near the data
p0 = np.zeros((nwalkers, ndim))
p0[:, 0] = np.random.normal(np.mean(y_obs), 0.05, nwalkers)         # mu
p0[:, 1] = np.abs(np.random.normal(0.1, 0.02, nwalkers))            # tau
for j in range(J):
    p0[:, 2 + j] = np.random.normal(y_obs[j], sigma[j] * 0.5, nwalkers)

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior,
                                 args=(y_obs, sigma, J))

# Burn-in
state = sampler.run_mcmc(p0, 1000, progress=True)
sampler.reset()

# Production
sampler.run_mcmc(state, 3000, progress=True)

samples = sampler.get_chain(flat=True)
print(f"Shape of samples: {samples.shape}")  # (nwalkers * 3000, ndim)

4.6 Checking the Results

# Extract hyperparameter posteriors
mu_samples = samples[:, 0]
tau_samples = samples[:, 1]
theta_samples = samples[:, 2:]

print(f"mu:  {np.median(mu_samples):.3f} "
      f"[{np.percentile(mu_samples, 16):.3f}, "
      f"{np.percentile(mu_samples, 84):.3f}]")
print(f"tau: {np.median(tau_samples):.3f} "
      f"[{np.percentile(tau_samples, 16):.3f}, "
      f"{np.percentile(tau_samples, 84):.3f}]")
print(f"True: mu = {mu_true}, tau = {tau_true}")

4.7 Visualising Shrinkage

This is the key plot. For each star, compare the raw observation $y_j$ (no pooling) with the hierarchical posterior median of $\theta_j$ (partial pooling).

fig, ax = plt.subplots(figsize=(10, 6))

theta_medians = np.median(theta_samples, axis=0)
mu_med = np.median(mu_samples)

# Plot arrows from observed to shrunk estimates
for j in range(J):
    ax.annotate('', xy=(theta_medians[j], j), xytext=(y_obs[j], j),
                arrowprops=dict(arrowstyle='->', color='grey', lw=1.2))

ax.scatter(y_obs, range(J), marker='o', color='black', s=40,
           zorder=5, label=r'Observed $y_j$')
ax.errorbar(y_obs, range(J), xerr=sigma, fmt='none', color='black',
            capsize=2, alpha=0.5)
ax.scatter(theta_medians, range(J), marker='s', color='C0', s=40,
           zorder=5, label=r'Hierarchical $\hat{\theta}_j$')
ax.scatter(theta_true, range(J), marker='d', color='red', s=30,
           zorder=5, alpha=0.7, label=r'True $\theta_j$')
ax.axvline(mu_med, color='blue', linestyle='--', linewidth=1.5,
           label=rf'Posterior $\mu$ = {mu_med:.3f}')

ax.set_ylabel('Star index')
ax.set_xlabel('[Fe/H]')
ax.set_title('Shrinkage: Observed vs Hierarchical Estimates')
ax.legend(loc='lower right', fontsize=9)
plt.tight_layout()
plt.show()

The grey arrows show each observation being pulled toward the population mean. Two patterns should be visible:

Stars with large $\sigma_j$ shrink more --- their data are noisier, so the population prior has more influence.
Stars with small $\sigma_j$ shrink less --- their data are precise, so they stay close to their observed values.

This is exactly the behaviour predicted by the shrinkage weight $w_j = \tau^2 / (\sigma_j^2 + \tau^2)$ from Section 3.

ℹ️

The shrinkage plot is the posterior predictive check for a hierarchical model. If the arrows all point inward (toward the population mean), the model is working as expected. If some arrows point outward, or if the hierarchical estimates are further from the truth than the raw data, something may be wrong with the model specification---perhaps the assumed population distribution is too restrictive.

5. Marginalising Out the Group Parameters

In the model above, we sampled $\mu$, $\tau$, and all $J$ individual $\theta_j$ jointly. This works, but the dimensionality grows linearly with the number of groups. For the simple Gaussian case, there is an alternative: marginalise out the $\theta_j$ analytically.

5.1 The Marginal Likelihood

When both the population distribution and the likelihood are Gaussian:

$$\theta_j \sim \mathcal{N}(\mu, \tau^2), \qquad y_j \mid \theta_j \sim \mathcal{N}(\theta_j, \sigma_j^2)$$

we can integrate out $\theta_j$:

$$p(y_j \mid \mu, \tau, \sigma_j) = \int p(y_j \mid \theta_j, \sigma_j) p(\theta_j \mid \mu, \tau) \mathrm{d}\theta_j$$

The integrand is a product of two Gaussians in $\theta_j$. Completing the square (or using the Gaussian convolution identity from lecture 5):

$$p(y_j \mid \mu, \tau, \sigma_j) = \mathcal{N}(y_j \mid \mu, \sigma_j^2 + \tau^2)$$

The marginal likelihood is just a Gaussian with variance equal to the sum of the measurement variance and the population variance. This makes physical sense: the total scatter you observe in $y_j$ around $\mu$ is the quadrature sum of the measurement noise and the intrinsic spread.

5.2 The Marginalised Posterior

After marginalisation, the posterior for the hyperparameters alone is:

$$p(\mu, \tau \mid \mathbf{y}, \boldsymbol{\sigma}) \propto p(\mu) p(\tau) \prod_{j=1}^{J} \mathcal{N}\left(y_j \middle| \mu, \sigma_j^2 + \tau^2\right)$$

This is a two-dimensional posterior regardless of $J$. Sampling is fast and mixing is excellent.

def log_posterior_marginalised(params, y, sigma):
    """Marginalised hierarchical log-posterior (mu, tau only)."""
    mu, tau = params
    if tau <= 0:
        return -np.inf

    # Hyperpriors
    lp = -0.5 * mu**2                                 # N(0, 1) on mu
    lp += -0.5 * (tau / 0.5)**2 + np.log(2)           # HalfNormal(0.5) on tau

    # Marginal likelihood
    s2 = sigma**2 + tau**2
    lp += -0.5 * np.sum((y - mu)**2 / s2 + np.log(2 * np.pi * s2))

    return lp

# Sample the 2D marginalised posterior
ndim_marg = 2
nwalkers_marg = 32

p0_marg = np.zeros((nwalkers_marg, ndim_marg))
p0_marg[:, 0] = np.random.normal(np.mean(y_obs), 0.05, nwalkers_marg)
p0_marg[:, 1] = np.abs(np.random.normal(0.1, 0.02, nwalkers_marg))

sampler_marg = emcee.EnsembleSampler(nwalkers_marg, ndim_marg,
                                      log_posterior_marginalised,
                                      args=(y_obs, sigma))

state_marg = sampler_marg.run_mcmc(p0_marg, 500, progress=True)
sampler_marg.reset()
sampler_marg.run_mcmc(state_marg, 3000, progress=True)

samples_marg = sampler_marg.get_chain(flat=True)
mu_marg = samples_marg[:, 0]
tau_marg = samples_marg[:, 1]

print(f"Marginalised posterior:")
print(f"  mu:  {np.median(mu_marg):.3f} "
      f"[{np.percentile(mu_marg, 16):.3f}, {np.percentile(mu_marg, 84):.3f}]")
print(f"  tau: {np.median(tau_marg):.3f} "
      f"[{np.percentile(tau_marg, 16):.3f}, {np.percentile(tau_marg, 84):.3f}]")

⚠️

Analytic marginalisation is not always possible. We could marginalise $\theta_j$ here because both the likelihood and the population distribution are Gaussian. If either is non-Gaussian---for example, a Poisson likelihood for count data, or a log-normal population distribution---the integral is no longer analytic and you must either sample the full $(J + 2)$-dimensional posterior or use numerical quadrature to approximate the marginal likelihood.

6. When You Cannot Marginalise: The Full Joint Approach

For most real problems, analytic marginalisation is not available. You must sample the full joint posterior over hyperparameters and group parameters simultaneously. This is where computational difficulties arise.

6.1 The Funnel Problem

Consider the joint posterior $p(\mu, \tau, \boldsymbol{\theta} \mid \mathbf{y})$. When $\tau$ is large, the $\theta_j$ are allowed to spread widely, and the posterior in $\theta_j$-space is broad. When $\tau$ is small, the $\theta_j$ are constrained to be close to $\mu$, and the posterior in $\theta_j$-space is narrow.

This creates a funnel-shaped geometry in the $(\tau, \theta_j)$ plane: a wide opening at large $\tau$ that narrows dramatically as $\tau \to 0$. MCMC samplers---especially gradient-based ones like Hamiltonian Monte Carlo---struggle with this geometry because the step size that works in the wide part of the funnel is too large for the narrow part, and vice versa.

The symptom in practice is divergent transitions (in HMC/NUTS) or very poor mixing (in ensemble samplers).

6.2 The Non-Centered Parameterisation

The solution is a change of variables. Instead of sampling $\theta_j$ directly, introduce a standardised variable:

$$\eta_j \sim \mathcal{N}(0, 1), \qquad \theta_j = \mu + \tau \cdot \eta_j$$

This is a deterministic transformation: given $(\mu, \tau, \eta_j)$, we recover $\theta_j$ exactly. But the posterior geometry is dramatically different. In the $(\tau, \eta_j)$ plane, the posterior is roughly elliptical regardless of $\tau$, because $\eta_j$ has a fixed $\mathcal{N}(0, 1)$ prior that does not depend on $\tau$.

Parameterisation	Variables sampled	Geometry	When to use
Centered	$\mu, \tau, \theta_j$	Funnel	Lots of data per group ($n_j \gg 1$)
Non-centered	$\mu, \tau, \eta_j$	Roughly elliptical	Few data per group ($n_j$ small)

The non-centered version is almost always better when you have one observation per group (our case). The centered version can be better when each group has many observations, because then the likelihood dominates the prior and the funnel disappears.

def log_posterior_noncentered(params, y, sigma, J):
    """Non-centered hierarchical log-posterior.

    params: [mu, tau, eta_1, ..., eta_J]
    theta_j = mu + tau * eta_j
    """
    mu = params[0]
    tau = params[1]
    eta = params[2:]

    if tau <= 0:
        return -np.inf

    # Hyperpriors
    lp = -0.5 * mu**2
    lp += -0.5 * (tau / 0.5)**2 + np.log(2)

    # Prior on eta: standard normal
    lp += -0.5 * np.sum(eta**2)

    # Transform to theta
    theta = mu + tau * eta

    # Likelihood
    lp += -0.5 * np.sum((y - theta)**2 / sigma**2 + np.log(2 * np.pi * sigma**2))

    return lp

ℹ️

The non-centered parameterisation is a reparameterisation trick. You are not changing the model---the joint posterior over $(\\mu, \\tau, \\boldsymbol{\\theta})$ is identical. You are only changing the coordinates in which you explore it. The sampler sees a smoother landscape and mixes better, but the scientific conclusions are unchanged. This is the same idea as reparameterising a covariance matrix in terms of a correlation and marginal variances, or using $\log \sigma$ instead of $\sigma$---a recurring theme in computational Bayesian inference.

7. A More Realistic Example: Estimating the Luminosity Function

The metallicity example was deliberately simple to expose the machinery. Let us now sketch a more realistic application that arises constantly in astronomy: estimating the luminosity function of a class of objects.

7.1 The Setup

You observe $J$ Type Ia supernovae. For each, you measure the peak apparent magnitude $m_j$ with uncertainty $\sigma_j$, and you know the redshift $z_j$ (and hence the distance modulus $\mu_\text{dist}(z_j)$, assuming a cosmology). The absolute magnitude is:

$$M_j = m_j - \mu_\text{dist}(z_j)$$

The scientific question: what is the distribution of absolute magnitudes $M_j$ across the population? In particular, what is the mean peak absolute magnitude $\langle M \rangle$ and the intrinsic scatter $\sigma_\text{int}$?

7.2 The Hierarchical Model

$$\langle M \rangle \sim \mathcal{N}(-19.3, 0.5^2) \qquad \text{(weakly informative)}$$

$$\sigma_\text{int} \sim \text{HalfNormal}(0.3)$$

$$M_j \sim \mathcal{N}(\langle M \rangle, \sigma_\text{int}^2) \qquad j = 1, \ldots, J$$

$$m_j \sim \mathcal{N}(M_j + \mu_\text{dist}(z_j), \sigma_j^2)$$

This is exactly the same structure as the metallicity example, with one layer of the distance modulus inserted. The marginal likelihood (after integrating out $M_j$) is:

$$p(m_j \mid \langle M \rangle, \sigma_\text{int}, z_j, \sigma_j) = \mathcal{N}\left(m_j \middle| \langle M \rangle + \mu_\text{dist}(z_j), \sigma_j^2 + \sigma_\text{int}^2\right)$$

7.3 Why This Matters

Without the hierarchical model, you would estimate each $M_j$ independently as $\hat{M}_j = m_j - \mu_\text{dist}(z_j) \pm \sigma_j$, then compute the sample mean and variance. But the sample variance of $\hat{M}_j$ is $\text{Var}(M_j) + \text{mean}(\sigma_j^2)$---it overestimates the intrinsic scatter because it includes the measurement noise. You would need to subtract the mean measurement variance in quadrature, which is ad hoc, does not propagate uncertainty correctly, and fails when $\sigma_j$ varies across the sample.

The hierarchical model handles all of this automatically: it jointly estimates $\langle M \rangle$, $\sigma_\text{int}$, and the individual $M_j$, accounting for heteroscedastic measurement uncertainties and propagating all sources of uncertainty into the final answer.

⚠️

Selection effects break this. If your sample is magnitude-limited (you only observe supernovae brighter than some threshold), the observed distribution of $m_j$ is truncated, and the simple Gaussian population model is biased. Accounting for selection effects requires modifying the likelihood to include the selection function---a topic for a more advanced treatment, but one that arises in virtually every astronomical application of hierarchical models.

8. Connection to Earlier Material

Hierarchical models are not a new technique---they are the Bayesian workflow you already know, applied at one more level.

8.1 The Workflow, Applied Hierarchically

Recall the six steps from lecture 4:

Step	Non-hierarchical	Hierarchical
1	Define data and parameters	Define data, group parameters, and hyperparameters
2	Write the generative model	Write the generative model with a population level
3	Derive the likelihood	Same, per group
4	Specify priors	Specify hyperpriors (priors on population parameters)
5	Sample the posterior	Sample the joint posterior (or marginalise analytically)
6	Posterior predictive checks	Check both individual fits and population predictions

8.2 Posterior Predictive Checks for Hierarchical Models

A hierarchical model makes predictions at two levels:

Individual level: for each object $j$, the posterior predictive distribution is $p(\tilde{y}_j \mid \mathbf{y}) = \int p(\tilde{y}_j \mid \theta_j) p(\theta_j \mid \mathbf{y}) \mathrm{d}\theta_j$. This checks whether the model fits each object.
Population level: the posterior predictive distribution for a new, unobserved object is $p(\tilde{y}_\text{new} \mid \mathbf{y}) = \int p(\tilde{y}_\text{new} \mid \theta_\text{new}) p(\theta_\text{new} \mid \mu, \tau) p(\mu, \tau \mid \mathbf{y}) \mathrm{d}\theta_\text{new} \mathrm{d}\mu \mathrm{d}\tau$. This checks whether the population distribution is reasonable.

Both checks should be performed. A model can fit the data well individually but have a population distribution that is too narrow or too broad, or has the wrong shape (e.g., unimodal when the data are bimodal).

# Population-level posterior predictive check
fig, ax = plt.subplots(figsize=(8, 5))

# Draw population parameters from posterior
n_draws = 5000
idx = np.random.choice(len(mu_samples), n_draws)
mu_draw = mu_samples[idx]
tau_draw = tau_samples[idx]

# For each draw, generate a new object from the population
theta_new = np.random.normal(mu_draw, tau_draw)

# Histogram of predicted new objects vs observed data
ax.hist(theta_new, bins=40, density=True, alpha=0.5, color='C0',
        label='Population predictive')
ax.hist(y_obs, bins=10, density=True, alpha=0.7, color='black',
        histtype='step', linewidth=2, label='Observed data')

# Overlay the true population distribution
x_grid = np.linspace(-0.6, 0.3, 200)
ax.plot(x_grid, stats.norm.pdf(x_grid, mu_true, tau_true),
        'r--', linewidth=2, label=rf'True: $\mathcal{{N}}$({mu_true}, {tau_true}$^2$)')

ax.set_xlabel('[Fe/H]')
ax.set_ylabel('Density')
ax.set_title('Population Posterior Predictive Check')
ax.legend(fontsize=10)
plt.tight_layout()
plt.show()

9. Practical Guidance

9.1 When to Use a Hierarchical Model

Use a hierarchical model when:

You have multiple groups (objects, experiments, epochs) with the same structure
Individual groups have limited data (few observations per group, or noisy measurements)
You believe the groups are exchangeable---drawn from the same underlying population
The scientific question is about the population, not just individual objects

Common astronomical applications include: stellar population studies, exoplanet demographics, galaxy scaling relations, standardisable candle calibration (Type Ia SNe, Cepheids), and any survey where you measure the same quantity for many objects with heterogeneous uncertainties.

9.2 Choosing the Population Distribution

The population distribution $p(\theta_j \mid \boldsymbol{\phi})$ is a modelling choice, just like any other part of the generative model. Guidelines:

Start simple. A Gaussian (or log-normal for positive quantities) is usually the right first model. You can always add complexity later.
Check with posterior predictive checks. If the population predictive distribution does not match the observed distribution of $y_j$, the population model may need to be more flexible (e.g., a mixture of Gaussians, a skew-normal, or a non-parametric distribution).
Be wary of overly flexible population models. A population model with too many parameters can overfit, effectively returning to the no-pooling case. The point of the hierarchical model is to regularise the individual estimates.

9.3 Hyperprior Choices

The hyperpriors $p(\boldsymbol{\phi})$ are the priors on the population parameters. They deserve the same care as any other prior:

For location parameters (e.g., $\mu$): use weakly informative priors centered on a physically reasonable value. A $\mathcal{N}(0, 1)$ or $\mathcal{N}(0, 10)$ is usually fine if the data are on a natural scale.
For scale parameters (e.g., $\tau$): use a half-normal, half-Cauchy, or exponential prior. The half-Cauchy $\text{HalfCauchy}(0, s)$ is a common recommendation because it allows for both very small and very large values of $\tau$, but a half-normal is fine when you have a reasonable sense of the scale.

⚠️

The prior on $\tau$ matters more than you think. When the number of groups $J$ is small, the data provide little information about the population spread, and the prior on $\tau$ can strongly influence the posterior. Always perform a prior sensitivity analysis: re-run the model with different hyperpriors and check whether the conclusions change.

9.4 Diagnosing Sampling Problems

Hierarchical models are harder to sample than non-hierarchical ones. Common symptoms and fixes:

Symptom	Likely cause	Fix
Poor mixing in $\tau$	Funnel geometry	Non-centered parameterisation
Divergent transitions (HMC)	Sharp curvature in funnel	Non-centered parameterisation, increase `target_accept`
$\hat{R} > 1.01$	Chains not converged	Run longer, reparameterise, improve initialisation
$\tau$ posterior piles up at zero	Too few groups or too-informative prior	Check prior, consider whether $\tau = 0$ is physically reasonable
Multimodal posterior	Mixture population or label switching	Constrain ordering, use more walkers

9.5 Scaling Up

For large $J$ (thousands of groups), emcee with the full $(J + 2)$-dimensional posterior becomes slow. Options:

Analytic marginalisation (when possible): reduces to a low-dimensional problem
PyMC: use NUTS (No-U-Turn Sampler), which handles high-dimensional posteriors efficiently via gradient information
Variational inference: approximate the posterior with a simpler distribution; fast but potentially biased
Expectation propagation or nested sampling: useful in specific contexts

For this course, emcee with the marginalised likelihood (Section 5) will handle problems up to several hundred groups comfortably.

10. The Eight Schools Example

No lecture on hierarchical models is complete without mentioning the eight schools problem (Rubin, 1981), a canonical example in Bayesian statistics.

Eight schools in the United States each ran an SAT coaching programme. The estimated treatment effect $y_j$ and its standard error $\sigma_j$ for each school were:

School	$y_j$	$\sigma_j$
A	28	15
B	8	10
C	−3	16
D	7	11
E	−1	9
F	1	11
G	18	10
H	12	18

The no-pooling estimate treats each school independently: School A's coaching effect is $28 \pm 15$, suggesting a large positive effect. The complete-pooling estimate ignores school identity and gives a single effect of roughly $8 \pm 4$. The hierarchical model finds something in between: School A's effect is shrunk from 28 toward the group mean, to roughly $10$--$12$, while the population mean is approximately $8$ with an intrinsic spread $\tau \approx 6$.

This is the James-Stein phenomenon in action: School A's extreme estimate of 28 is likely inflated by noise. The hierarchical model tempers it.

# Eight schools data
y_schools = np.array([28, 8, -3, 7, -1, 1, 18, 12])
sigma_schools = np.array([15, 10, 16, 11, 9, 11, 10, 18])

# Fit using the marginalised posterior
ndim_8s = 2
nwalkers_8s = 32
p0_8s = np.column_stack([
    np.random.normal(8, 2, nwalkers_8s),
    np.abs(np.random.normal(5, 1, nwalkers_8s))
])

sampler_8s = emcee.EnsembleSampler(
    nwalkers_8s, ndim_8s, log_posterior_marginalised,
    args=(y_schools, sigma_schools)
)
state_8s = sampler_8s.run_mcmc(p0_8s, 500, progress=True)
sampler_8s.reset()
sampler_8s.run_mcmc(state_8s, 5000, progress=True)

samples_8s = sampler_8s.get_chain(flat=True)
mu_8s = samples_8s[:, 0]
tau_8s = samples_8s[:, 1]

# Compute shrinkage estimates for each school
theta_hat_8s = np.zeros(8)
for j in range(8):
    w = tau_8s**2 / (sigma_schools[j]**2 + tau_8s**2)
    theta_hat_8s[j] = np.median(w * y_schools[j] + (1 - w) * mu_8s)

print("School | y_j | sigma_j | theta_hat (hierarchical)")
print("-" * 55)
for j in range(8):
    school = chr(65 + j)
    print(f"  {school}    | {y_schools[j]:3d} |  {sigma_schools[j]:2d}     | {theta_hat_8s[j]:.1f}")

ℹ️

The eight schools problem is small enough to fit on a napkin, yet it demonstrates every important feature of hierarchical models: shrinkage, partial pooling, the funnel geometry, and the sensitivity to the prior on $\tau$. If you understand this example thoroughly, you understand hierarchical models.

11. Summary

Hierarchical models add a population level to the Bayesian generative model. Instead of fixing the prior on group-level parameters $\theta_j$, you let the population distribution $p(\theta_j \mid \boldsymbol{\phi})$ be learned from data. The hyperparameters $\boldsymbol{\phi}$ and the group parameters $\boldsymbol{\theta}$ are inferred jointly.

The joint posterior factorises as:

$$p(\boldsymbol{\phi}, \boldsymbol{\theta} \mid \mathbf{y}) \propto p(\boldsymbol{\phi}) \prod_{j=1}^{J} p(\theta_j \mid \boldsymbol{\phi}) p(y_j \mid \theta_j)$$

Shrinkage is the defining behaviour: individual estimates are pulled toward the population mean, with the degree of pulling determined by the ratio of measurement uncertainty to population spread. This reduces total estimation error across the population.

Marginalisation can sometimes eliminate the group parameters analytically, reducing the posterior to the hyperparameters alone. For the Gaussian-Gaussian case, $p(y_j \mid \mu, \tau) = \mathcal{N}(y_j \mid \mu, \sigma_j^2 + \tau^2)$.

The non-centered parameterisation ($\theta_j = \mu + \tau \eta_j$ with $\eta_j \sim \mathcal{N}(0, 1)$) eliminates the funnel geometry that causes sampling difficulties, and should be your default when each group has few observations.

Posterior predictive checks must be performed at both the individual and population levels. The population-level check asks whether the inferred population distribution is consistent with the distribution of observed data.

The workflow is the same as before: specify the generative model, write the likelihood, choose priors (now hyperpriors), sample the posterior, and check the results. The only new ingredient is the population level in the DAG.

Hierarchical Models

1. The Problem: Populations of Objects

1.1 No Pooling (Separate Estimates)

1.2 Complete Pooling (Single Estimate)

1.3 Partial Pooling (Hierarchical Model)

2. The Structure of a Hierarchical Model

2.1 The DAG

2.2 The Generative Model

2.3 The Joint Posterior

3. Shrinkage

3.1 Why Shrinkage Happens

3.2 The Shrinkage Weight

3.3 Shrinkage Is a Feature

4. A Worked Example: Stellar Metallicities

4.1 Generating Simulated Data

4.2 Visualising the Problem

4.3 The Model

4.4 Implementing the Log-Posterior

4.5 Sampling with emcee

4.6 Checking the Results

4.7 Visualising Shrinkage

5. Marginalising Out the Group Parameters

5.1 The Marginal Likelihood

5.2 The Marginalised Posterior

6. When You Cannot Marginalise: The Full Joint Approach

6.1 The Funnel Problem

6.2 The Non-Centered Parameterisation

7. A More Realistic Example: Estimating the Luminosity Function

7.1 The Setup

7.2 The Hierarchical Model

7.3 Why This Matters

8. Connection to Earlier Material

8.1 The Workflow, Applied Hierarchically

8.2 Posterior Predictive Checks for Hierarchical Models

9. Practical Guidance

9.1 When to Use a Hierarchical Model

9.2 Choosing the Population Distribution

9.3 Hyperprior Choices

9.4 Diagnosing Sampling Problems

9.5 Scaling Up

10. The Eight Schools Example

11. Summary

Further Reading