Markov Chain Monte Carlo

You have already used MCMC. In Lecture 6, you called emcee.EnsembleSampler, fed it a log-probability function, and got back a chain of samples that you visualised with corner plots and posterior predictive draws. It worked. But we treated the sampler as a black box: samples went in, posteriors came out, and the mechanism in between was opaque.

This lecture opens the box. We derive the Metropolis algorithm from first principles, build a working sampler in Python, and understand why the procedure converges to the correct posterior distribution. Then we return to emcee and Hamiltonian Monte Carlo with the vocabulary to understand what they are actually doing and, critically, how to diagnose when they are not doing it well.

1. The Problem: Intractable Integrals

Everything we want from a posterior distribution is an integral. The posterior mean of a parameter is:

$$\langle \theta \rangle = \int \theta p(\theta \mid y) d\theta$$

The variance, the credible intervals, the marginal distributions — all integrals. For a model with $d$ parameters, these are $d$-dimensional integrals over the full posterior surface.

In low dimensions, you can evaluate these on a grid. Place $K$ points along each axis and compute $p(\theta \mid y)$ at each grid point. But the number of evaluations scales as $K^d$. For $K = 100$ and $d = 10$, that is $10^{20}$ evaluations — roughly the number of grains of sand on Earth, and completely infeasible.

MCMC solves this by drawing $N$ samples $\{\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)}\}$ from the posterior such that the sample density is proportional to the probability density. Then any integral becomes a simple average:

$$\langle \theta \rangle \approx \frac{1}{N} \sum_{k=1}^{N} \theta^{(k)}$$

The samples cluster where the posterior is high and thin out where it is low. Figure 1 shows this correspondence: the density of points in the scatter plot matches the contours of the underlying distribution.

Figure 1: The central idea of MCMC. Left: contours of a 2D posterior distribution. Right: 2000 samples drawn from that distribution by an MCMC sampler. The sample density matches the probability density — regions of high probability have more samples, and the 2D histogram of samples recovers the original distribution.

2. What Is a Markov Chain?

A Markov chain is a sequence of states $\theta^{(0)}, \theta^{(1)}, \theta^{(2)}, \ldots$ where each state depends only on the immediately preceding one:

$$p(\theta^{(k+1)} \mid \theta^{(0)}, \theta^{(1)}, \ldots, \theta^{(k)}) = p(\theta^{(k+1)} \mid \theta^{(k)})$$

This is the memoryless property: the chain has no memory of how it arrived at its current position. The next step depends on where you are now, not where you have been.

Analogy: imagine walking through a city where at each intersection you randomly choose a direction. Your choice at each intersection depends only on which intersection you are at, not on the path you took to get there. That is a Markov chain over the city's intersections.

The key theoretical result is that under mild conditions (irreducibility, aperiodicity), a Markov chain converges to a unique stationary distribution regardless of where it starts. If we can construct a chain whose stationary distribution is our posterior $p(\theta \mid y)$, then after enough steps the samples from the chain will be draws from the posterior.

3. The Metropolis Algorithm

The Metropolis algorithm constructs exactly such a chain. It has three steps, repeated at each iteration $k$:

Step 1: Propose. Draw a candidate position $\theta'$ from a proposal distribution $q(\theta' \mid \theta^{(k)})$. A common choice is a Gaussian centred on the current position:

$$\theta' = \theta^{(k)} + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma_\text{prop}^2 \mathbf{I})$$

Step 2: Compute the acceptance ratio.

$$\alpha = \frac{p(\theta' \mid y)}{p(\theta^{(k)} \mid y)}$$

If the proposed position has higher posterior probability than the current position, $\alpha > 1$. If lower, $\alpha < 1$.

Step 3: Accept or reject. Draw $u \sim \text{Uniform}(0, 1)$.

$$\theta^{(k+1)} = \begin{cases} \theta' & \text{if } u < \min(1, \alpha) \\\\ \theta^{(k)} & \text{otherwise} \end{cases}$$

If the proposal is better, always accept. If the proposal is worse, accept it with probability $\alpha$ — and the worse it is, the less likely you are to move there. If rejected, the chain stays at its current position and that position is recorded again.

ℹ️

The normalisation constant cancels. This is the reason MCMC is so powerful for Bayesian inference. We have $p(\theta \mid y) = p(y \mid \theta) p(\theta) / p(y)$, and the acceptance ratio is:

$$\alpha = \frac{p(\theta' \mid y)}{p(\theta^{(k)} \mid y)} = \frac{p(y \mid \theta') p(\theta')}{p(y \mid \theta^{(k)}) p(\theta^{(k)})}$$

The evidence $p(y)$ appears in both numerator and denominator and cancels. We never need to compute it. All we need is the unnormalised posterior — i.e., the likelihood times the prior — which is what your log_probability function already returns.

Figure 2 illustrates a single Metropolis step. The proposal can move uphill (always accepted) or downhill (accepted probabilistically).

Figure 2: A single step of the Metropolis algorithm. From the current position (blue dot), a proposal is drawn from a Gaussian proposal distribution. If the proposal lands in a region of higher posterior density it is accepted (green arrow, left). If it lands in a region of lower density it may still be accepted with probability $\alpha$ (orange arrow, middle), or rejected (red X, right), in which case the chain stays put.

4. Building a Sampler From Scratch

Here is a complete Metropolis sampler in Python. We apply it to a simple problem: estimating the mean $\mu$ and standard deviation $\sigma$ of a Gaussian from 50 data points.

import numpy as np

# Generate synthetic data
np.random.seed(42)
data = np.random.normal(loc=5.0, scale=2.0, size=50)

def log_likelihood(theta, data):
    mu, log_sigma = theta
    sigma = np.exp(log_sigma)
    return -0.5 * np.sum(((data - mu) / sigma)**2) - len(data) * np.log(sigma)

def log_prior(theta):
    mu, log_sigma = theta
    if not (-10 < mu < 20 and -5 < log_sigma < 5):
        return -np.inf
    return 0.0  # flat prior within bounds

def log_posterior(theta, data):
    lp = log_prior(theta)
    if not np.isfinite(lp):
        return -np.inf
    return lp + log_likelihood(theta, data)

def metropolis(log_prob, start, data, n_steps, proposal_scale):
    """Simple Metropolis sampler."""
    ndim = len(start)
    chain = np.zeros((n_steps, ndim))
    chain[0] = start
    log_prob_current = log_prob(start, data)
    n_accepted = 0

    for k in range(1, n_steps):
        # Step 1: Propose
        proposal = chain[k-1] + proposal_scale * np.random.randn(ndim)

        # Step 2: Acceptance ratio (in log space)
        log_prob_proposal = log_prob(proposal, data)
        log_alpha = log_prob_proposal - log_prob_current

        # Step 3: Accept or reject
        if np.log(np.random.rand()) < log_alpha:
            chain[k] = proposal
            log_prob_current = log_prob_proposal
            n_accepted += 1
        else:
            chain[k] = chain[k-1]

    print(f"Acceptance rate: {n_accepted / n_steps:.1%}")
    return chain

# Run the sampler
chain = metropolis(log_posterior, start=np.array([0.0, 0.0]),
                   data=data, n_steps=20000, proposal_scale=0.2)

⚠️

We sample $\log \sigma$ rather than $\sigma$ directly. The standard deviation must be positive, and sampling in log-space automatically enforces this constraint while also improving the geometry of the posterior (it makes the posterior more symmetric). This is a common parameterisation trick.

Figure 3 shows the chain's trajectory overlaid on the posterior contours.

Figure 3: The first 500 steps of a Metropolis chain on the 2D posterior for $(\mu, \log\sigma)$. The chain starts at the green dot and wanders through parameter space, spending more time in high-probability regions. The trajectory shows the characteristic random-walk behaviour of the Metropolis algorithm.

5. Why Does It Work?

The Metropolis algorithm satisfies detailed balance: for any two states $\theta_a$ and $\theta_b$,

$$p(\theta_a) T(\theta_b \mid \theta_a) = p(\theta_b) T(\theta_a \mid \theta_b)$$

where $T(\theta_b \mid \theta_a)$ is the probability of transitioning from $\theta_a$ to $\theta_b$ (proposing $\theta_b$ and accepting it). This condition means that, in equilibrium, the flow of probability from $\theta_a$ to $\theta_b$ equals the reverse flow. It guarantees that if the chain is sampling from $p(\theta)$ at step $k$, it is still sampling from $p(\theta)$ at step $k+1$. The target distribution is stationary.

To see why Metropolis satisfies this, consider two states with $p(\theta_a) > p(\theta_b)$. Since the Gaussian proposal is symmetric, $q(\theta_b \mid \theta_a) = q(\theta_a \mid \theta_b)$. The transition probability is $T(\theta_b \mid \theta_a) = q(\theta_b \mid \theta_a) \cdot \min\big(1, p(\theta_b)/p(\theta_a)\big) = q \cdot p(\theta_b)/p(\theta_a)$. The reverse is $T(\theta_a \mid \theta_b) = q \cdot \min\big(1, p(\theta_a)/p(\theta_b)\big) = q \cdot 1$. Multiplying each side by the target density:

$$p(\theta_a) \cdot q \cdot \frac{p(\theta_b)}{p(\theta_a)} = q \cdot p(\theta_b) = p(\theta_b) \cdot q \cdot 1$$

Both sides are equal — detailed balance holds.

The intuition is straightforward: the chain spends more time in high-probability regions because proposals into those regions are more likely to be accepted, and proposals out of them are more likely to be rejected. In the limit of infinite samples, the histogram of the chain converges to the target distribution.

6. Tuning the Proposal Distribution

The proposal scale $\sigma_\text{prop}$ controls how far the chain tries to jump at each step, and its value has an enormous effect on sampling efficiency.

Too small ($\sigma_\text{prop} \ll$ posterior width): Almost every proposal is accepted (high acceptance rate), but the chain moves in tiny steps and takes a very long time to explore the posterior. This is an inefficient random walk.
Too large ($\sigma_\text{prop} \gg$ posterior width): Most proposals land far out in the tails where the posterior is negligible, so they are rejected. The chain stays stuck at its current position for long stretches.
Just right ($\sigma_\text{prop} \sim$ posterior width): A balance between exploration and acceptance. The optimal acceptance rate for a Metropolis sampler is approximately 23% for high-dimensional problems and 44% for 1D problems. A practical target for most problems is 25--50%.

Figure 4: The effect of proposal scale on sampling. Left: too small ($\sigma_\text{prop} = 0.02$) — high acceptance rate but the chain barely moves. Centre: well-tuned ($\sigma_\text{prop} = 0.3$) — a balance of acceptance and exploration. Right: too large ($\sigma_\text{prop} = 5.0$) — almost all proposals are rejected and the chain is stuck.

ℹ️

Check your acceptance rate. If it is above 80%, your proposal is too small. If it is below 10%, your proposal is too large. Adjust $\sigma_\text{prop}$ until the acceptance rate falls in the 25--50% range.

7. Burn-In

When a chain starts far from the high-probability region of the posterior, the early samples trace a path toward that region but do not represent the stationary distribution. This initial transient is called burn-in.

Figure 5: Burn-in illustrated. The chain starts at $\mu = -5$ (far from the posterior mode near $\mu \approx 5$) and takes several hundred steps to reach the high-probability region. The shaded region marks the burn-in period whose samples should be discarded. After burn-in, the chain oscillates around the posterior mean — the "fuzzy caterpillar" pattern of a well-mixed chain.

In practice:

Examine the trace plot. Identify where the chain "settles down" — where the running mean stabilises and the variance becomes stationary.
Discard all samples before that point.
Be generous. If burn-in looks like it ends at step 200, discard the first 500.

8. Autocorrelation and Effective Sample Size

Consecutive MCMC samples are correlated — the position at step $k+1$ depends on step $k$ by construction. This means that $N$ MCMC samples contain less information than $N$ independent samples.

The autocorrelation function measures how correlated sample $k$ is with sample $k + \ell$ at lag $\ell$:

$$\rho(\ell) = \frac{\text{Cov}(\theta^{(k)}, \theta^{(k+\ell)})}{\text{Var}(\theta^{(k)})}$$

At lag 0, $\rho(0) = 1$ by definition. As the lag increases, $\rho(\ell)$ decays toward zero — eventually two samples separated by enough steps become effectively independent.

Figure 6: Autocorrelation function for a Metropolis chain. The autocorrelation decays from 1 at lag 0 and crosses zero around lag ${\sim}40$. The integrated autocorrelation time $\tau$ for this chain is approximately 20 steps, meaning every 20th sample is roughly independent.

The integrated autocorrelation time is:

$$\tau = 1 + 2\sum_{\ell=1}^{\infty} \rho(\ell)$$

The effective sample size (ESS) is:

$$N_\text{eff} = \frac{N}{\tau}$$

A chain of 10,000 samples with $\tau = 50$ has an effective sample size of only 200. The standard error on any posterior estimate scales as $1 / \sqrt{N_\text{eff}}$, not $1/\sqrt{N}$.

⚠️

Rule of thumb: you need at least $50\tau$ samples after burn-in (Foreman-Mackey et al. 2013). If your autocorrelation time is 100 steps, you need at least 5,000 post-burn-in samples for reliable posterior estimates.

9. Trace Plots and Visual Diagnostics

The trace plot — parameter value versus step number — is the most important diagnostic tool for MCMC. Figure 7 shows what a well-behaved chain looks like.

Figure 7: Trace plots for $\mu$ and $\log\sigma$ from a Metropolis chain on the Gaussian fitting problem. The shaded region marks burn-in (discarded). After burn-in, both parameters show the "fuzzy caterpillar" pattern: rapid oscillation around a stable mean with no long-term trends. This is what a converged chain looks like.

A well-mixed chain should look like a "fuzzy caterpillar" — oscillating rapidly around a stationary mean with no visible trends, drifts, or long periods of stagnation. If you see any of the following, the chain has not converged:

A persistent upward or downward trend
Long plateaus where the chain is stuck at one value
Sudden jumps between distinct regions
Very slow, smooth wandering (proposal too small)

10. Ensemble Samplers: How emcee Works

Now that you understand the basic Metropolis algorithm, you can appreciate why emcee is different — and better — for most problems.

10.1 The Problem with Metropolis

The Metropolis sampler has a critical weakness: the proposal distribution must be tuned by hand. For a $d$-dimensional problem with correlated parameters, you need to set a $d \times d$ proposal covariance matrix. Get it wrong and the sampler either crawls or gets stuck.

10.2 The Stretch Move

emcee uses an affine-invariant ensemble sampler (Goodman & Weare 2010). Instead of one walker, it runs an ensemble of $L$ walkers simultaneously. Each walker proposes its next position using the positions of the other walkers:

For walker $j$, randomly choose a companion walker $k$ (where $k \neq j$).
Propose a new position along the line connecting walkers $j$ and $k$:

$$\theta' = \theta^{(k)} + z (\theta^{(j)} - \theta^{(k)})$$

where $z$ is drawn from a distribution $g(z) \propto 1/\sqrt{z}$ for $z \in [1/a, a]$ with $a = 2$ by default. This is the stretch move.

Accept or reject with probability $\min\big(1, \; z^{d-1} p(\theta') / p(\theta^{(j)})\big)$. The factor $z^{d-1}$ corrects for the non-symmetric proposal distribution.

10.3 Why This Is Better

The stretch move is affine-invariant: if you linearly transform the parameter space (rotate, scale, skew), the algorithm's performance does not change. This means it automatically handles correlated parameters without any tuning — the ensemble itself learns the shape of the posterior.

The only hyperparameter is the number of walkers $L$, which must satisfy $L \geq 2d$. In practice, $L = 4d$ to $10d$ works well.

10.4 Connecting to the Code You Already Wrote

Compare the Metropolis sampler from Section 4 to the emcee workflow from Lecture 6:

	Metropolis (Section 4)	emcee
Walkers	1	$L \geq 2d$ (typically $\sim 4d$)
Proposal	Gaussian with hand-tuned $\sigma_\text{prop}$	Stretch move using other walkers
Tuning needed	Proposal scale (critical)	Number of walkers (not critical)
Handles correlations	Poorly (without covariance tuning)	Automatically (affine-invariant)
Log-prob function	Same	Same

The log_probability function you pass to emcee.EnsembleSampler is identical to the log_posterior function in our scratch sampler. The only difference is the machinery that generates proposals and decides whether to accept them.

ℹ️

This is why emcee "just works." When you used emcee in Lecture 6, you did not need to set a proposal covariance matrix or tune a step size. The ensemble of walkers adapts to the posterior geometry automatically. The trade-off is that emcee requires multiple walkers, each of which must evaluate the log-probability at every step.

11. Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) takes a fundamentally different approach. Instead of a random walk, it uses the gradient of the log-posterior to guide proposals.

11.1 The Physical Analogy

Imagine placing a ball on a surface where the height at each point equals $-\log p(\theta \mid y)$ — the negative log-posterior. Low points correspond to high probability. Give the ball a random kick (momentum) and let it roll. It will naturally spend more time in the valleys (high-probability regions) and less time on the peaks (low-probability regions).

11.2 How It Works

HMC introduces an auxiliary momentum variable $\rho$ of the same dimensionality as $\theta$. The algorithm alternates between:

Resample momentum: Draw $\rho \sim \mathcal{N}(0, \mathbf{I})$.
Simulate Hamiltonian dynamics: Evolve $(\theta, \rho)$ forward in time using the leapfrog integrator for $L$ steps with step size $\varepsilon$. Each step updates momentum and position alternately:

$$\rho \leftarrow \rho + \frac{\varepsilon}{2} \nabla_\theta \log p(\theta \mid y) \qquad \text{(half-step kick)}$$

$$\theta \leftarrow \theta + \varepsilon \rho \qquad \text{(full-step drift)}$$

$$\rho \leftarrow \rho + \frac{\varepsilon}{2} \nabla_\theta \log p(\theta \mid y) \qquad \text{(half-step kick)}$$

Repeat this $L$ times (adjacent half-kicks merge into full kicks, so the cost is $L$ gradient evaluations, not $2L$).

Accept or reject the endpoint using a Metropolis criterion based on the total energy (Hamiltonian).

11.3 Advantages

Because HMC follows the gradient, it can propose points that are far from the current position but still in high-probability regions. This gives it dramatically lower autocorrelation and higher acceptance rates (often 60--90%) compared to random-walk Metropolis.

Figure 8: Metropolis (left) vs HMC (right) on the same 2D posterior, both after 200 steps. The Metropolis chain takes small, correlated steps and explores slowly. The HMC chain makes large, directed moves that efficiently cover the posterior. The difference is even more dramatic in higher dimensions.

11.4 Interactive Demo

The following interactive visualisation (by Chi Feng) lets you compare MCMC algorithms side-by-side on different target distributions. Try switching between Random Walk Metropolis and HMC to see the difference in exploration efficiency.

11.5 Software

HMC and its adaptive variant NUTS (No-U-Turn Sampler) are implemented in:

PyMC — Python, automatic differentiation, NUTS by default
Stan (via cmdstanpy or PyStan) — fast compiled HMC/NUTS
NumPyro — Python/JAX, very fast on GPU

These tools compute gradients automatically using autodifferentiation, so you only need to specify the model.

12. The Corner Plot

The standard way to visualise an MCMC posterior is the corner plot (also called a triangle plot or pair plot): 1D marginal histograms along the diagonal, and 2D joint distributions in the lower triangle.

Figure 9: Corner plot from the Metropolis chain for the Gaussian fitting problem, showing the marginal posteriors on $\mu$ and $\log\sigma$ and their joint distribution. The dashed lines mark the 16th, 50th, and 84th percentiles. The 2D panel shows a hexbin density plot of the joint posterior. The true values (red stars) fall within the posterior, as expected for well-calibrated inference.

13. Convergence Diagnostics

How do you know when to trust your chain? Here is a practical checklist:

Trace plots. Look for the fuzzy caterpillar pattern. No trends, no stagnation, no jumps between modes.
Autocorrelation time. Compute $\tau$ and verify that you have at least $50\tau$ samples after burn-in. In emcee:

tau = sampler.get_autocorr_time()  # one value per parameter
print(f"Autocorrelation time: {tau}")
n_samples = sampler.get_chain().shape[0]
print(f"Effective samples per parameter: {n_samples / tau}")
print(f"Need at least 50*tau = {50 * tau} samples")

Gelman-Rubin $\hat{R}$. Run multiple chains from different starting points. Compute the ratio of between-chain variance to within-chain variance. $\hat{R} < 1.01$ indicates convergence.
Corner plots. Check that the posteriors are smooth, unimodal (or have the expected shape), and that credible intervals are sensible.
When in doubt, run longer. Double the chain length and compare the results. If the posterior changes, the original chain was not long enough.

14. When MCMC Fails

MCMC is not a universal solution. It struggles in several situations:

Multimodal distributions. If the posterior has well-separated modes, a chain that starts in one mode may never find the others. The random walk has no mechanism for jumping across low-probability valleys.

Figure 10: A Metropolis chain on a bimodal posterior. The chain discovers the left mode and gets trapped there, never finding the right mode. The resulting histogram (right) shows only one peak — a dangerously wrong approximation to the true bimodal posterior (dashed line). Multimodal posteriors require specialised methods such as parallel tempering or nested sampling.

Very high dimensions. In high dimensions, even well-tuned Metropolis becomes inefficient because the optimal proposal scale shrinks and the chain takes $O(d^2)$ steps to produce an independent sample. HMC scales much better, but can still struggle above $d \sim 1000$.

Strong parameter correlations. Highly correlated parameters create narrow, elongated posteriors that are hard for isotropic proposals to explore. Reparameterisation (rotating or rescaling parameters to reduce correlations) often helps dramatically.

What to do:

For multimodal problems: consider nested sampling (e.g., dynesty, ultranest) or parallel tempering
For high dimensions or strong correlations: use HMC (PyMC, Stan, NumPyro) or reparameterise your model
For pure optimisation (you only need the MAP, not the full posterior): use scipy.optimize, not MCMC
If you plan to discard the posterior and only keep a point estimate: MCMC is wasted effort

15. Summary

MCMC draws samples from the posterior such that sample density $\propto$ probability density. Integrals become averages over samples.

A Markov chain is a sequence where each step depends only on the current position. Under mild conditions, it converges to a unique stationary distribution.

The Metropolis algorithm proposes a new position, computes the acceptance ratio $\alpha = p(\theta')/p(\theta^{(k)})$, and accepts with probability $\min(1, \alpha)$. The normalisation constant cancels in the ratio, which is why MCMC works for Bayesian inference without computing the evidence.

Burn-in is the transient before the chain reaches the stationary distribution. Discard it generously.

Autocorrelation makes consecutive samples correlated. The effective sample size is $N_\text{eff} = N / \tau$. You need at least $50\tau$ samples.

emcee uses an ensemble of walkers with stretch moves. It is affine-invariant, requires no tuning of the proposal distribution, and handles parameter correlations automatically.

HMC uses gradient information to make large, directed proposals. It has higher acceptance rates and lower autocorrelation than Metropolis, but requires differentiable models. PyMC, Stan, and NumPyro implement it.

Convergence diagnostics are mandatory: trace plots, autocorrelation times, Gelman-Rubin $\hat{R}$, and corner plots. Never trust a chain you have not checked.