Fitting a Line to Data II: Priors, MCMC, and Robust Fitting

The previous lecture derived the exact weighted least-squares solution for a straight line, introduced the matrix formulation, and extended the likelihood to handle $x$ uncertainties. It ended with a problem: once $x$ uncertainties are included, the effective variance $\sigma_{y_i}^2 + m^2\sigma_{x_i}^2$ depends on $m$, so the normal equations no longer apply. We need numerical methods.

This lecture covers three things:

Prior probabilities — how to encode prior information about parameters, and why seemingly "uninformative" uniform priors are actually quite informative for slopes.
Numerical methods — MAP estimation with scipy.optimize and posterior sampling with emcee, applied to the x+y uncertainty problem from the previous lecture.
Robust fitting — handling correlated measurement uncertainties, intrinsic scatter, and outliers, each of which requires a principled extension to the generative model.

We continue with the Hogg, Bovy & Lang (2010) dataset throughout.

1. Prior Probabilities

1.1 Why Priors Matter for Slopes

The prior $p(m, b)$ encodes what we believed about the parameters before seeing the data. For the intercept $b$, a uniform prior over some wide range is usually uncontroversial. The slope $m$ is more subtle.

Consider the question: what does an "uninformative" prior on $m$ look like? A uniform prior $m \sim \mathcal{U}(0, 10)$ seems natural, but it is not actually uninformative — it assigns much higher probability to nearly-vertical lines than to nearly-horizontal ones. This is because the same "amount of rotation" near $m \approx 0$ corresponds to a much smaller change in $m$ than near $m \approx \pm 10$.

1.2 The Invariant Prior

A prior that is invariant to the choice of which variable is called $x$ and which is called $y$ — i.e., invariant to rotation of the axes — is:

$$p(m, b) \propto \frac{1}{(1 + m^2)^{3/2}}$$

This prior is equivalent to placing a uniform prior on the angle $\theta = \arctan(m)$ rather than on the slope directly.

The corresponding log-prior is:

$$\log p(m, b) = -\frac{3}{2}\log(1 + m^2) + \text{const}$$

def ln_prior(theta):
    b, m = theta
    return -1.5 * np.log(1 + m**2)

1.3 Prior Predictive Check

Before fitting any data, the Bayesian workflow requires a prior predictive check: draw parameter values from the prior and simulate the data those parameters would produce. This tests whether the prior is consistent with what you know about the problem before the data have any influence.

For the straight-line model the question is: does the prior on $(m, b)$ generate lines that span a physically reasonable range, or does it concentrate on orientations or intercepts that would never arise in practice?

ℹ️

What to look for in a prior predictive check. A good prior predictive check shows simulated lines (or datasets) that cover the plausible range of outcomes without overwhelming mass on implausible extremes. You are not looking for lines that match the data — the data have not been used yet. You are looking for a prior that permits the data to distinguish between different hypotheses rather than ruling out plausible hypotheses before inference begins. If virtually all prior-predictive lines are nearly vertical, the prior is placing a strong implicit constraint on the slope, whether you intended that or not.

Figure 1: Prior predictive check. Left: drawing slopes from a uniform prior $m \sim \mathcal{U}(0, 10)$ — the sampled lines cluster near vertical because the prior assigns equal probability to all slope values, while nearly vertical lines have much larger absolute slopes. Right: drawing slopes from the invariant prior $p(m) \propto (1+m^2)^{-3/2}$ — orientations are uniformly distributed, which is the truly "uninformative" choice. The invariant prior passes the prior predictive check; the uniform prior does not.

2. Maximum A Posteriori (MAP) Estimation

2.1 The Log-Posterior

The log-posterior combines the log-likelihood with the log-prior:

$$\log p(m, b \mid \text{data}) = \log \mathcal{L}(m, b) + \log p(m, b)$$

For the $x+y$ uncertainty model from the previous lecture, the log-likelihood is:

$$\log \mathcal{L}(m, b) = -\frac{1}{2}\sum_{i=1}^{N} \left[\frac{(y_i - mx_i - b)^2}{\sigma_{y_i}^2 + m^2\sigma_{x_i}^2} + \log(\sigma_{y_i}^2 + m^2\sigma_{x_i}^2)\right]$$

The MAP estimate is the mode of the posterior — the parameter values that maximise the log-posterior (or equivalently, minimise the negative log-posterior).

2.2 scipy.optimize

Since the log-posterior is not quadratic in $m$ (due to the $m^2$ term in the denominator), we cannot solve the normal equations. Instead, we minimise the negative log-posterior numerically:

import numpy as np
import scipy.optimize as op

def ln_likelihood_xyerr(theta, x, y, x_err, y_err):
    b, m = theta
    sigma2 = y_err**2 + m**2 * x_err**2
    residuals = y - m * x - b
    return -0.5 * np.sum(residuals**2 / sigma2 + np.log(sigma2))

def ln_prior(theta):
    b, m = theta
    return -1.5 * np.log(1 + m**2)

def ln_prob_xyerr(theta, x, y, x_err, y_err):
    return ln_prior(theta) + ln_likelihood_xyerr(theta, x, y, x_err, y_err)

# MAP estimation: minimise negative log-posterior
result = op.minimize(
    lambda *args: -ln_prob_xyerr(*args),
    x0=[200.0, 1.2],
    args=(x, y, x_err, y_err),
    method="L-BFGS-B"
)
b_map, m_map = result.x

The L-BFGS-B algorithm uses gradient information to efficiently find the minimum, and its name refers to the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm with box (bound) constraints.

2.3 Effect of Including X Uncertainties

Figure 2: Comparison of MAP fits. Left: ignoring $\sigma_x$ (y-only uncertainty model). Right: including $\sigma_x$ (x+y uncertainty model). Including $x$ uncertainties correctly down-weights points with large $\sigma_x$ and produces a different slope estimate. The difference is modest for this dataset, but can be significant when $\sigma_x \sim \sigma_y/m$.

ℹ️

MAP vs. MLE. Maximum likelihood estimation (MLE) maximises $\log \mathcal{L}$ alone. MAP maximises $\log \mathcal{L} + \log p$. For weak priors and lots of data, the difference is small. For small datasets or strong priors, it can matter. In either case, MAP gives only a point estimate — not a full posterior.

3. MCMC Sampling with emcee

MAP estimation gives the most probable parameter values but not the uncertainty on those values. For that we need the full posterior distribution, which we approximate by sampling.

3.1 emcee: the MCMC Hammer

emcee is an affine-invariant ensemble sampler that uses a collection of walkers to sample the posterior. Each walker proposes moves based on the positions of other walkers, making it effective for non-Gaussian and correlated posteriors without any tuning.

import emcee

ndim = 2       # number of parameters: b, m
nwalkers = 32  # must be >= 2 * ndim; use more for complex posteriors

# Initialise walkers near the MAP estimate
p0 = [result.x + 1e-5 * np.random.randn(ndim) for _ in range(nwalkers)]

sampler = emcee.EnsembleSampler(
    nwalkers, ndim, ln_prob_xyerr,
    args=(x, y, x_err, y_err)
)

3.2 Burn-In and Production

MCMC chains must first "burn in" — move away from their initialisation and into the high-probability region. Discard these initial samples:

# Burn-in phase: run, discard, reset
pos, _, _ = sampler.run_mcmc(p0, n_steps=500, progress=False)
sampler.reset()

# Production phase: run and keep
sampler.run_mcmc(pos, n_steps=1000, progress=False)

# Flatten all walkers into one chain
chain = sampler.get_chain(flat=True)   # shape: (nwalkers * 1000, 2)

⚠️

Check convergence before trusting the chain. The integrated autocorrelation time $\tau$ tells you how many steps are needed for the chain to become independent. emcee provides sampler.get_autocorr_time(). As a rough rule, you want at least $50\tau$ samples after burn-in. For most 2D problems, 500 burn-in and 1000 production steps are sufficient, but more complex models may need far more.

3.3 Posterior Visualisation and Reporting

Corner plots show all 1D marginal distributions and all 2D joint distributions on one figure:

from corner import corner

fig = corner(chain, labels=[r'$b$', r'$m$'],
             quantiles=[0.16, 0.5, 0.84],
             show_titles=True, title_kwargs={"fontsize": 12})

Figure 3: Corner plot comparing posteriors from the y-only model (blue) and the x+y uncertainty model (orange). Each model has been run through emcee with the same invariant prior. Including $x$ uncertainties broadens the posterior on $m$ and shifts the median slightly — the $x$ errors are not negligible for this dataset.

ℹ️

Reporting MCMC results. Do not report the MAP or the chain mean alone. The standard practice is to report the median and the 68% credible interval (the 16th–84th percentile range, equivalent to $\pm 1\sigma$ for a Gaussian). For a parameter $\theta$, report it as:

$$\theta = \text{median}^{+\Delta_+}\kern0pt_{-\Delta_-}$$

where $\Delta_+$ is the 84th percentile minus the median and $\Delta_-$ is the median minus the 16th percentile. If the posterior is close to Gaussian, $\Delta_+ \approx \Delta_-$ and you can simplify to $\theta = \text{median} \pm \sigma$. The corner library computes this automatically when you pass quantiles=[0.16, 0.5, 0.84].

b_med, b_lo, b_hi = np.percentile(chain[:, 0], [50, 16, 84])
m_med, m_lo, m_hi = np.percentile(chain[:, 1], [50, 16, 84])
print(f"b = {b_med:.1f} +{b_hi-b_med:.1f} / -{b_med-b_lo:.1f}")
print(f"m = {m_med:.3f} +{m_hi-m_med:.3f} / -{m_med-m_lo:.3f}")

Posterior predictive check — drawing lines from the posterior and overlaying them on the data — is a visual check that the fitted model actually describes the observations. For each posterior sample $(b^{(s)}, m^{(s)})$, the predicted line $y = m^{(s)} x + b^{(s)}$ represents one plausible data-generating process consistent with the posterior. The envelope of these lines should bracket the data. If every data point falls well outside the envelope, or if the envelope is far wider than the data scatter, the model is misspecified.

fig, ax = plt.subplots()
x_line = np.linspace(0, 310, 200)

# Draw 100 random samples from the chain
idx = np.random.choice(len(chain), size=100, replace=False)
for b_s, m_s in chain[idx]:
    ax.plot(x_line, m_s * x_line + b_s, color='steelblue', alpha=0.12, lw=0.8)

ax.errorbar(x, y, xerr=x_err, yerr=y_err, fmt='o', color='#333355')

Figure 4: 100 posterior predictive draws from the x+y uncertainty model. The spread of lines represents the full posterior uncertainty on $m$ and $b$, including the correlation between them. A well-specified model should have data points scattered roughly symmetrically about the band, with a spread consistent with the reported error bars.

4. Arbitrary Two-Dimensional Uncertainties

The previous model assumed that $x$ and $y$ measurement errors are uncorrelated. The Hogg et al. (2010) dataset includes off-diagonal covariance information — the $x$ and $y$ errors are correlated. This is a model extension: we are adding information to the generative model without changing the parameters of scientific interest.

Generative model. Each data point $(x_i, y_i)$ is drawn from a 2D Gaussian centred on a true position $(\hat{x}_i, m\hat{x}_i + b)$ on the line, with a known covariance structure $\mathbf{C}_i$. The parameters remain $\theta = (b, m)$ with priors as before; the covariance matrices $\mathbf{C}_i$ are treated as known from the data file.

4.1 The Covariance Matrix

When $x$ and $y$ measurement errors are correlated, the uncertainty on a single data point $(x_i, y_i)$ is described by a $2 \times 2$ covariance matrix:

$$\mathbf C_i = \begin{bmatrix} \sigma_{x_i}^2 & \rho_{xy,i}\sigma_{x_i}\sigma_{y_i} \\\\ \rho_{xy,i}\sigma_{x_i}\sigma_{y_i} & \sigma_{y_i}^2 \end{bmatrix}$$

where $\rho_{xy,i}$ is the correlation coefficient between the $x$ and $y$ errors for point $i$. The Hogg et al. (2010) dataset provides all three quantities: $\sigma_{x_i}$, $\sigma_{y_i}$, and $\rho_{xy,i}$.

4.2 Marginalisation in 2D

For the straight-line model $y = mx + b$, define the residual vector for point $i$:

$$\mathbf{R}_i = \begin{bmatrix} x_i - \hat{x}_i \\\\ y_i - m\hat{x}_i - b \end{bmatrix}$$

where $\hat{x}_i$ is the unknown true $x$ position. Marginalising over $\hat{x}_i$ (integrating over all possible true positions) yields an effective 1D variance for the residual in the direction perpendicular to the fitting direction:

$$\Sigma_i^2 = \mathbf{V}^\top \mathbf{C}_i \mathbf{V} \qquad \text{where} \qquad \mathbf{V} = \begin{bmatrix} -m \\\\ 1 \end{bmatrix}$$

Expanding this:

$$\Sigma_i^2 = m^2\sigma_{x_i}^2 - 2m\rho_{xy,i}\sigma_{x_i}\sigma_{y_i} + \sigma_{y_i}^2$$

When $\rho_{xy} = 0$ (uncorrelated errors), this reduces to the previous lecture's result: $\Sigma_i^2 = m^2\sigma_{x_i}^2 + \sigma_{y_i}^2$.

The marginalised log-likelihood is:

$$\log \mathcal{L}(m, b) = -\frac{1}{2}\sum_{i=1}^{N} \left[\frac{(y_i - mx_i - b)^2}{\Sigma_i^2} + \log \Sigma_i^2\right]$$

4.3 Python Implementation

import numpy as np

def build_covariance_matrices(x_err, y_err, rho_xy):
    """Build per-point 2x2 covariance matrices."""
    N = len(x_err)
    C = np.zeros((N, 2, 2))
    C[:, 0, 0] = x_err**2
    C[:, 1, 1] = y_err**2
    C[:, 0, 1] = rho_xy * x_err * y_err
    C[:, 1, 0] = rho_xy * x_err * y_err
    return C

def ln_likelihood_cov(theta, x, y, C):
    b, m = theta
    V = np.array([-m, 1.0])                      # projection vector
    # Effective variance: V^T C_i V for each point
    Sigma2 = np.einsum('i,nij,j->n', V, C, V)    # shape (N,)
    residuals = y - m * x - b
    return -0.5 * np.sum(residuals**2 / Sigma2 + np.log(Sigma2))

def ln_prior(theta):
    b, m = theta
    return -1.5 * np.log(1 + m**2)

def ln_probability_cov(theta, x, y, C):
    return ln_prior(theta) + ln_likelihood_cov(theta, x, y, C)

# Build covariance matrices
C = build_covariance_matrices(x_err, y_err, rho_xy)

Figure 5: Corner plot comparing three posterior estimates. Blue: using only $y$ uncertainties (the basic weighted least-squares result, now sampled with MCMC). Orange: using $x$ and $y$ uncertainties but assuming they are uncorrelated. Green: using the full $2 \times 2$ covariance matrix including correlations. Each extension to the model produces a measurably different posterior, demonstrating that the correlation structure of measurement errors is not just a technicality—it affects inference. This is the iterative model-building workflow: add one complication, verify the posterior shifts in the expected direction, then continue.

Figure 6: Posterior predictive check for the full covariance model: 100 lines drawn from the posterior, overlaid on data shown as error ellipses that capture $\sigma_x$, $\sigma_y$, and $\rho_{xy}$. The tilted ellipses encode the correlation between $x$ and $y$ measurement errors for each point. A good posterior predictive check shows the data points and their error ellipses consistent with the spread of posterior lines — points that lie far outside the band even accounting for their error ellipses may indicate model misspecification or genuine outliers.

5. Intrinsic Scatter

5.1 The Problem

Even after properly accounting for measurement uncertainties, residuals are often larger than expected. This is intrinsic scatter: physical variation in the true relationship that is not explained by the model $y = mx + b$.

For example, in the Tully–Fisher relation (luminosity vs. rotation velocity for galaxies), there is real scatter around the mean relation caused by differences in galaxy morphology, inclination, and environment—not just measurement noise.

If we ignore intrinsic scatter and the data genuinely has it, we will:

Underestimate parameter uncertainties
Get a reduced chi-squared $\chi^2_\nu \gg 1$
Have systematically incorrect credible intervals

5.2 Modelling Intrinsic Scatter

The expanded generative model is: each data point has a true position drawn from a distribution that is itself scattered around the line, and then subject to measurement noise. Specifically, the true $y$ value for a point at true $x$ position $\hat{x}_i$ is not exactly $m\hat{x}_i + b$ — it is displaced by a random amount perpendicular to the line with amplitude $\lambda$. The observed $(x_i, y_i)$ then add measurement noise on top.

Parameters: $\theta = (b, m, \ln\lambda)$, where $\lambda > 0$ is the intrinsic scatter amplitude. Priors: rotation-invariant prior on $(b, m)$ as before; $\ln\lambda \sim \mathcal{U}(-10, 10)$ (a log-uniform prior over a wide range of scatter scales).

We add a new parameter $\lambda$ (the intrinsic scatter amplitude) that contributes additional variance orthogonal to the line. The intrinsic covariance matrix is:

$$\boldsymbol{\Lambda}_i = \frac{\lambda^2}{1 + m^2} \begin{bmatrix} m^2 & -m \\\\ -m & 1 \end{bmatrix}$$

This matrix projects scatter onto the direction perpendicular to the line (i.e., away from the line in the direction that matters for the residuals). The factor $(1 + m^2)^{-1}$ normalises correctly so that $\lambda$ represents the true scatter perpendicular to the line.

The effective variance for point $i$ now becomes:

$$\Sigma_i^2 = \mathbf{V}^\top (\mathbf C_i + \boldsymbol\Lambda_i) \mathbf{V} = \underbrace{\mathbf{V}^\top \mathbf C_i \mathbf{V}}\kern0pt_{\text{measurement}} + \underbrace{\lambda^2}\kern0pt_{\text{intrinsic}}$$

ℹ️

Why log-lambda? We parameterise the intrinsic scatter as $\ln\lambda$ rather than $\lambda$ itself. This is because $\lambda$ must be positive, and a uniform prior on $\ln\lambda$ (corresponding to a log-uniform or Jeffreys prior on $\lambda$) correctly reflects our state of ignorance over the scale of the scatter. It also makes the sampling better-behaved because $\ln\lambda$ ranges over all reals.

5.3 Implementation

def ln_likelihood_intrinsic(theta, x, y, C):
    b, m, ln_lambda = theta
    V = np.array([-m, 1.0])

    # Intrinsic scatter covariance matrix
    lam2 = np.exp(ln_lambda)**2
    Lambda = (lam2 / (1 + m**2)) * np.array([[m**2, -m], [-m, 1]])

    # Total effective variance = measurement + intrinsic
    Sigma2 = np.einsum('i,nij,j->n', V, C + Lambda, V)
    residuals = y - m * x - b
    return -0.5 * np.sum(residuals**2 / Sigma2 + np.log(Sigma2))

def ln_prior_intrinsic(theta):
    b, m, ln_lambda = theta
    if not (-10 < ln_lambda < 10):
        return -np.inf    # hard prior boundaries on ln_lambda
    return -1.5 * np.log(1 + m**2)

def ln_probability_intrinsic(theta, x, y, C):
    lp = ln_prior_intrinsic(theta)
    if not np.isfinite(lp):
        return -np.inf
    return lp + ln_likelihood_intrinsic(theta, x, y, C)

Running MCMC with 3 parameters ($b$, $m$, $\ln\lambda$):

import emcee, scipy.optimize as op

# Initial optimisation with 3 parameters
initial = np.array([200.0, 1.2, np.log(20.0)])
result = op.minimize(
    lambda *args: -ln_probability_intrinsic(*args),
    initial, args=(x, y, C), method="L-BFGS-B"
)

ndim, nwalkers = 3, 64
p0 = [result.x + 1e-4 * np.random.randn(ndim) for _ in range(nwalkers)]

sampler = emcee.EnsembleSampler(
    nwalkers, ndim, ln_probability_intrinsic, args=(x, y, C)
)
pos, _, _ = sampler.run_mcmc(p0, 1000, progress=True)
sampler.reset()
sampler.run_mcmc(pos, 2000, progress=True)

chain = sampler.get_chain(flat=True)   # shape: (nwalkers*2000, 3)

Figure 7: Corner plot for the intrinsic scatter model with three parameters: intercept $b$, slope $m$, and log intrinsic scatter $\ln\lambda$. The posterior on $\ln\lambda$ is well-constrained and clearly positive, confirming that this dataset contains scatter beyond measurement noise alone.

Figure 8: Posterior predictive check for the intrinsic scatter model: 100 lines drawn from the posterior. The broader band of lines compared to the pure measurement uncertainty model reflects the inferred intrinsic scatter $\lambda$. A key test: points that appeared to be outliers under the simpler model should now fall within the expected spread if intrinsic scatter is the correct explanation. If several points still fall well outside the band, that is evidence of a different problem — true outliers or a more complex population — and the model should be extended further.

6. The Outlier Problem

6.1 Why Threshold-Based Approaches Fail

In practice, every real dataset contains some points that do not follow the main model. The tempting approach is to flag them as "outliers" and remove them—perhaps using a $3\sigma$ or $5\sigma$ threshold. But this approach has a fundamental flaw.

How should you determine outlier status?

Before or after optimisation?
Should you iterate—remove, refit, check again?
Do you use the distance in $y$, in $x$, or orthogonally to the line?
What threshold is "right"?

There is no principled answer to any of these questions, because determining with certainty whether a point is an outlier is impossible. A point that looks like a $4\sigma$ outlier under one model may be entirely consistent with a slightly different model. Removing it biases the fit toward the model you assumed in order to remove it.

⚠️

All ad hoc outlier removal is wrong. Once you remove a data point and refit, the new fit was tuned to the remaining data—including data that might itself be outliers. The process is not self-consistent, and the resulting parameter uncertainties are understated because they do not account for the uncertainty in which points were removed.

6.2 The Probabilistic Solution: Mixture Models

The correct approach replaces binary outlier labels with probabilities. We introduce a latent variable $q_i \in \{0, 1\}$ for each point:

$q_i = 0$: point $i$ belongs to the straight-line model ("foreground")
$q_i = 1$: point $i$ is an outlier from a different distribution ("background")

Generative model. Each data point is first assigned membership by drawing $q_i \sim \text{Bernoulli}(1 - Q)$. If $q_i = 0$, the $y$ value is drawn from the foreground model (the line). If $q_i = 1$, the $y$ value is drawn from a broad background Gaussian. The observed data do not include $q_i$ — it is a latent (unobserved) variable. The background distribution represents whatever is causing the outliers (e.g., contamination by a different stellar population, pipeline errors, misidentified sources) without specifying the physical mechanism in detail.

Parameters: $\theta = (m, b, Q, \mu_\text{out}, V_\text{out})$, where $Q$ is the fraction of points on the line, $\mu_\text{out}$ is the mean of the background distribution, and $V_\text{out}$ is its variance. Priors: rotation-invariant prior on $(m, b)$; $Q$ parameterised in logit space with broad bounds; $\mu_\text{out}$ and $\ln V_\text{out}$ with wide uniform priors covering the data range.

The mixture likelihood for point $i$ is:

$$p(y_i \mid x_i, \sigma_{y_i}, \theta) = Q \cdot p_\text{fg}(y_i) + (1 - Q) \cdot p_\text{bg}(y_i)$$

where the foreground and background likelihoods are:

$$p_\text{fg}(y_i) = \frac{1}{\sqrt{2\pi\sigma_{y_i}^2}} \exp\left(-\frac{(y_i - mx_i - b)^2}{2\sigma_{y_i}^2}\right)$$$$p_\text{bg}(y_i) = \frac{1}{\sqrt{2\pi(\sigma_{y_i}^2 + V_\text{out})}} \exp\left(-\frac{(y_i - \mu_\text{out})^2}{2(\sigma_{y_i}^2 + V_\text{out})}\right)$$

6.3 Key Insight: Marginalise Over Membership

A naive implementation would try to infer all $N$ binary labels $\{q_i\}$ simultaneously with the 5 continuous parameters—that is $N + 5$ parameters for $N$ data points. This is both over-parameterised and hard to sample.

The right approach is to marginalise over the $q_i$ values analytically. This is a standard Bayesian operation: the latent variable $q_i$ is unobserved, so we should sum (integrate) it out of the joint distribution. Since each $q_i$ is binary, the sum has exactly two terms:

$$p(y_i \mid x_i, \sigma_{y_i}, \theta) = \sum_{q_i \in \{0,1\}} p(y_i \mid x_i, \sigma_{y_i}, \theta, q_i) p(q_i \mid Q)$$$$= Q \cdot p_\text{fg}(y_i) + (1 - Q) \cdot p_\text{bg}(y_i)$$

This is exactly the mixture likelihood above—and it requires only the 5 continuous parameters $(m, b, Q, \mu_\text{out}, V_\text{out})$. The marginalisation is what makes this a principled Bayesian treatment rather than an ad hoc outlier removal: we have not decided that any point is an outlier; we have integrated over all possible membership assignments, weighted by their probability.

The full log-likelihood over all data points is:

$$\log \mathcal{L}(\theta) = \sum_{i=1}^{N} \log\left[Q \cdot p_\text{fg}(y_i) + (1 - Q) \cdot p_\text{bg}(y_i)\right]$$

7. Implementing the Mixture Model

7.1 Log-Probability in Practice

Computing $\log(a + b)$ when $a$ and $b$ are very small numbers requires care—working directly with the logs avoids numerical underflow:

def ln_probability_mixture(theta, x, y, y_err):
    m, b, Q_logit, mu_out, ln_V_out = theta

    # Transform Q from logit space (ensures 0 < Q < 1)
    Q = 1.0 / (1.0 + np.exp(-Q_logit))
    V_out = np.exp(ln_V_out)

    def ln_prior(theta):
        m, b, Q_logit, mu_out, ln_V_out = theta
        # Flat priors with bounds
        if not (-100 < b < 600): return -np.inf
        if not (-10 < m < 10):   return -np.inf
        if not (50 < mu_out < 600): return -np.inf
        if not (0 < ln_V_out < 16): return -np.inf
        return -1.5 * np.log(1 + m**2)   # invariant slope prior

    lp = ln_prior(theta)
    if not np.isfinite(lp):
        return -np.inf

    # Foreground log-likelihood per point
    sigma2_fg = y_err**2
    ln_pfg = -0.5 * ((y - m * x - b)**2 / sigma2_fg + np.log(sigma2_fg))
    ln_pfg -= 0.5 * np.log(2 * np.pi)

    # Background log-likelihood per point
    sigma2_bg = y_err**2 + V_out
    ln_pbg = -0.5 * ((y - mu_out)**2 / sigma2_bg + np.log(sigma2_bg))
    ln_pbg -= 0.5 * np.log(2 * np.pi)

    # Mixture: log(Q * pfg + (1-Q) * pbg)
    # Use scipy.special.logsumexp or manual logaddexp for numerical stability
    ln_L = np.logaddexp(np.log(Q) + ln_pfg, np.log(1 - Q) + ln_pbg)

    return lp + np.sum(ln_L)

7.2 Running the MCMC

The mixture model has 5 parameters: slope $m$, intercept $b$, mixture fraction $Q$ (parameterised in logit space to enforce $0 < Q < 1$), outlier mean $\mu_\text{out}$, and log outlier variance $\ln V_\text{out}$.

import emcee, scipy.optimize as op

# Initial guess
initial = np.array([1.2, 200.0, 1.5, 400.0, np.log(5000)])
# [m, b, logit(Q~0.8), mu_out, ln(V_out)]

result = op.minimize(
    lambda *a: -ln_probability_mixture(*a),
    initial, args=(x, y, y_err), method="L-BFGS-B"
)

ndim, nwalkers = 5, 64
p0 = [result.x + 1e-3 * np.random.randn(ndim) for _ in range(nwalkers)]

sampler = emcee.EnsembleSampler(
    nwalkers, ndim, ln_probability_mixture, args=(x, y, y_err)
)

# Burn-in
pos, _, _ = sampler.run_mcmc(p0, 1000, progress=True)
sampler.reset()

# Production
sampler.run_mcmc(pos, 2000, progress=True)
chain = sampler.get_chain(flat=True)

Figure 9: Corner plot for the mixture model (showing only $m$, $b$, and $Q$ for clarity). The inferred mixture fraction $Q \approx 0.75$ means that the model estimates roughly 3/4 of the data points belong to the straight-line model. Compare the posterior on $m$ and $b$ to the simpler models from earlier lectures—the mixture model is more robust to the outlier points.

8. Posterior Membership Probabilities

8.1 Which Points Are Outliers?

After MCMC, we can ask: for each data point, what is the posterior probability that it belongs to the line model? This is a posterior predictive quantity — it is a statement about the latent membership variable $q_i$ given the observed data and all model parameters, marginalised over parameter uncertainty.

Using Bayes' theorem applied to the latent variable $q_i$:

$$p(q_i = 0 \mid y_i, \theta) = \frac{Q \cdot p_\text{fg}(y_i)}{Q \cdot p_\text{fg}(y_i) + (1 - Q) \cdot p_\text{bg}(y_i)}$$

This is the posterior foreground probability for point $i$, given a specific parameter draw $\theta$.

To get the full posterior probability (marginalised over all parameter uncertainty), average over the MCMC chain:

$$\bar{p}(q_i = 0 \mid \text{data}) \approx \frac{1}{M} \sum_{k=1}^{M} p(q_i = 0 \mid y_i, \theta^{(k)})$$

where the sum is over $M$ MCMC samples $\theta^{(k)}$.

# Compute per-point foreground probabilities averaged over chain
def foreground_probability(chain, x, y, y_err):
    probs = np.zeros((len(chain), len(x)))
    for k, theta in enumerate(chain):
        m, b, Q_logit, mu_out, ln_V_out = theta
        Q = 1.0 / (1.0 + np.exp(-Q_logit))
        V_out = np.exp(ln_V_out)

        sigma2_fg = y_err**2
        pfg = Q * np.exp(-0.5 * (y - m*x - b)**2 / sigma2_fg) / np.sqrt(sigma2_fg)

        sigma2_bg = y_err**2 + V_out
        pbg = (1 - Q) * np.exp(-0.5 * (y - mu_out)**2 / sigma2_bg) / np.sqrt(sigma2_bg)

        probs[k] = pfg / (pfg + pbg)
    return probs.mean(axis=0)

# Subsample chain for speed
idx = np.random.choice(len(chain), size=500, replace=False)
p_fg = foreground_probability(chain[idx], x, y, y_err)

# Points with p_fg < 0.5 are more likely outliers than inliers
for i, prob in enumerate(p_fg):
    if prob < 0.5:
        print(f"Point {i+1}: p(foreground) = {prob:.2f} — likely outlier")

Figure 10: Posterior predictive draws from the mixture model with data points coloured by their inferred foreground probability $\bar{p}(q_i = 0 \mid \text{data})$. Light grey points have low foreground probability (likely outliers); dark points have high foreground probability (confidently on the line). The mixture model assigns continuous outlier probabilities rather than making a hard binary decision—this is the correct way to handle uncertain membership.

9. Extensions and Limitations

The mixture model presented here is simplified in several ways:

Y-only errors: we have used only $\sigma_{y_i}$ in the mixture likelihood. The full treatment would incorporate the 2D covariance matrix $\mathbf{C}_i$ as in Section 4.
Simple outlier model: the background distribution is a single broad Gaussian. In practice, outliers might come from multiple populations, or the background distribution might be physically motivated.
Fixed model structure: we assumed that the foreground model is a straight line with no intrinsic scatter. A more complete model would combine intrinsic scatter (Section 5) with the mixture approach (Sections 6–8).

These extensions are straightforward in principle—you simply modify the likelihood functions—but they add parameters and increase MCMC runtime. The underlying structure remains the same: write down a generative model for the data, derive the log-posterior, and sample it.

ℹ️

Iterative model building is the Bayesian workflow. This lecture has followed the principled Bayesian workflow described in Lecture 4, step by step. At each stage we: (1) described the generative model clearly, naming all parameters and specifying their priors; (2) derived the likelihood from that generative model; (3) sampled the posterior with MCMC; (4) examined the posterior predictive draws to check for model failure. The sequence — y-only errors → x+y errors → 2D covariance → intrinsic scatter → mixture model — is not arbitrary. Each step was motivated by a specific failure the simpler model could not handle. This is how model complexity should be earned: each new parameter must solve a detectable problem in the previous model's posterior predictive check.

When your posterior predictive check fails (residuals too large, data outside the predictive band, systematic trends in the residuals), that failure tells you what to add to the model. When it passes, you have a model that is consistent with all the structure in the data you have chosen to test.

10. Summary

Prior probabilities encode knowledge before seeing data. A uniform prior on slope $m$ is not uninformative—it clusters probability on near-vertical lines. The rotation-invariant prior $p(m,b) \propto (1+m^2)^{-3/2}$ correctly treats all orientations equally. Always do a prior predictive check.

MAP estimation via scipy.optimize.minimize finds the mode of the posterior. It is fast and works when the normal equations fail (e.g., when the likelihood is non-linear in the parameters). But it gives only a point estimate—not a distribution.

MCMC with emcee samples the full posterior. The workflow is: initialise walkers near the MAP, run a burn-in phase, discard those samples, run the production phase, and flatten the chain. Check autocorrelation times before trusting the results.

Correlated 2D uncertainties are handled by a per-point covariance matrix $\mathbf C_i$. The effective variance after marginalisation is $\Sigma_i^2 = \mathbf{V}^\top \mathbf C_i \mathbf{V}$ where $\mathbf{V} = (-m, 1)^\top$. The cross-term $-2m\rho_{xy}\sigma_x\sigma_y$ can either increase or decrease the effective variance depending on the sign of $\rho_{xy}$ and $m$.

Intrinsic scatter is modelled as an additional covariance matrix $\boldsymbol{\Lambda}$ added to $\mathbf C_i$, with amplitude $\lambda$ as a free parameter. Parameterising as $\ln\lambda$ avoids positivity constraints and gives better-behaved sampling.

Outliers cannot be reliably identified with threshold-based rules. The principled solution is a mixture model that assigns a probability $Q$ that any point belongs to the foreground model, and marginalises over the binary membership variable analytically. This reduces the problem to 5 continuous parameters, sampleable with MCMC.

Posterior membership probabilities assign a continuous foreground probability to each data point—the probability that it belongs to the line model given all the data and the inferred parameters. These are not binary labels; they express the model's uncertainty about each point's membership.