Practical Bayesian Inference in Astronomy

Practical Bayesian Inference in Astronomy

Bayesian inference is not a single algorithm. It is a framework for learning from data—one that has been applied across physics and astronomy for decades, but inconsistently, and with widely varying notation and terminology. A physicist estimating a galaxy cluster mass and a statistician modelling a pulsar timing residual may be doing the same calculation and not recognise it.

This lecture addresses that gap. It follows the framework laid out by Eadie et al. (2023), a practical reference that consolidates Bayesian terminology and notation across the astronomical subdisciplines and gives practical guidance on how to do the inference correctly. By the end of this lecture you will know not just the mechanics of Bayesian inference but the decisions you have to make at every step—and why they matter.


1. Notation: Why It Matters

There is no standard notation for Bayesian inference across fields. Some authors write $\pi(\theta)$ for the prior, others write $p(\theta)$, others $P(\theta)$. Some use $\mathcal{L}(\theta)$ for the likelihood, others $L(\theta \mid y)$, others $p(y \mid \theta)$. A reader moving between papers in physics, astronomy, and applied math will encounter all of these and more.

This matters because notation carries information about what is a function of what. When you write $p(y \mid \theta)$, you are emphasising that this is a probability distribution over data $y$ for a given parameter value $\theta$. When you write $L(\theta)$, you are emphasising that you are viewing the same mathematical object as a function of parameters $\theta$ for fixed data $y$.

These are the same number—but they encode different perspectives, and confusing them leads to errors.

1.1 The Notation We Will Use

Following Eadie et al. (2023), we use the following conventions throughout this course:

SymbolNameMeaning
$\theta$ParametersThe quantities we want to learn about
$y$DataObserved values
$\tilde{y}$Replicated dataNew data drawn from the posterior predictive
$p(\theta)$PriorProbability distribution encoding beliefs before data
$p(y \mid \theta)$LikelihoodProbability of data given parameters
$p(\theta \mid y)$PosteriorUpdated beliefs after seeing data
$p(\tilde{y} \mid y)$Posterior predictiveDistribution over new data given observed data

When we need to emphasise that we are viewing the likelihood as a function of $\theta$ for fixed $y$, we write $L(\theta; y)$ or just $L(\theta)$. But the underlying function is always $p(y \mid \theta)$.

1.2 Bayes’ Theorem: The Engine

Everything in this lecture flows from one equation:

$$p(\theta \mid y) = \frac{p(y \mid \theta) p(\theta)}{p(y)}$$

where the marginal likelihood (also called the evidence) is:

$$p(y) = \int p(y \mid \theta) p(\theta) d\theta$$

In words: the posterior is proportional to the likelihood times the prior. The evidence is a normalising constant that ensures the posterior integrates to 1.

Because $p(y)$ does not depend on $\theta$, we almost always work with the unnormalised form:

$$p(\theta \mid y) \propto p(y \mid \theta) p(\theta)$$

This is the formula you will write down and implement for every model in this course.

ℹ️

Why is $p(y)$ so hard? The evidence requires integrating the likelihood times the prior over all possible parameter values. For high-dimensional models, this integral is analytically intractable. This is why we need numerical methods (MCMC) to compute the posterior—and why we can usually avoid computing $p(y)$ altogether for parameter estimation.

The evidence only becomes essential when we want to compare models—a topic we return to in Section 8.


2. The Bayesian Inference Problem

A complete Bayesian inference problem has three components that you must specify before touching any data:

  1. The likelihood $p(y \mid \theta)$: your model of the data-generating process
  2. The prior $p(\theta)$: your knowledge about the parameters before seeing data
  3. The inference algorithm: a method for computing or approximating $p(\theta \mid y)$

These are not optional. A Bayesian model without a specified likelihood is not a Bayesian model. A Bayesian model without a specified prior is not a Bayesian model.

2.1 The Joint Distribution

Before any data arrives, your model defines a joint distribution over both data and parameters:

$$p(y, \theta) = p(y \mid \theta) p(\theta)$$

This is the complete probabilistic description of your scientific model. It says: here are the plausible parameter values (the prior), and here is how those parameters produce data (the likelihood).

When data arrives, you condition on what you observed:

$$p(\theta \mid y) = \frac{p(y, \theta)}{p(y)}$$

The posterior is what remains after you fix $y$ to its observed value.

2.2 The Likelihood Is Not a Distribution Over $\theta$

This is the single most common source of confusion in Bayesian inference.

The likelihood $p(y \mid \theta)$ is a probability distribution over data $y$ for a fixed parameter value $\theta$. If you integrate it over $y$, you get 1:

$$\int p(y \mid \theta) dy = 1$$

The likelihood is not a probability distribution over $\theta$. If you integrate it over $\theta$, you do not generally get 1:

$$\int p(y \mid \theta) d\theta \neq 1 \quad \text{(in general)}$$

This matters because it tells you what the likelihood can and cannot do. The likelihood tells you how surprising your data would be under different parameter values. It does not tell you which parameter values are probable. That is the job of the prior and posterior.

Figure 1

Figure 1: Left: the likelihood $p(y \mid \theta)$ as a function of data $y$ for fixed $\theta = \theta_0$—this is a proper distribution over $y$ integrating to 1. Right: the same function viewed as a function of $\theta$ for fixed data $y_\text{obs}$—this is the likelihood surface, and it does not integrate to 1. The posterior (shaded) combines the likelihood surface with the prior.


3. The Likelihood Function

The likelihood encodes your model of the data-generating process. Choosing an appropriate likelihood is the most consequential modelling decision you will make.

3.1 Choosing the Right Likelihood Family

The choice of likelihood is constrained by the type of data you are analysing. The table below gives the most common choices in astronomy:

Data typeExampleLikelihood family
Counts (non-negative integers)Photon counts, source detectionsPoisson
Binary (0 or 1)Detection/non-detection, classificationBinomial/Bernoulli
Positive realsLuminosities, fluxes, distancesLogNormal, Gamma
Reals with additive Gaussian noiseRadial velocities, magnitudesNormal
Rates (positive)GRB rate, SN rateGamma-Poisson
Bounded intervalInclination angle, efficiencyBeta
CategoricalSpectral class, morphologyCategorical/Multinomial

Using a Normal likelihood for positive-definite data is a common and serious mistake. The Normal distribution has support on all of $\mathbb{R}$—it assigns nonzero probability to negative values. If your data are physically constrained to be positive (fluxes, luminosities, masses), your likelihood should reflect that.

3.2 The Likelihood and Measurement Error

Most astronomical measurements come with associated uncertainties. How you handle these determines whether your posterior represents what you think it does.

Homoscedastic errors: every observation has the same measurement uncertainty $\sigma$.

$$p(y_i \mid \theta, \sigma) = \mathcal{N}(y_i \mid f(\theta), \sigma^2)$$

Heteroscedastic errors: each observation has its own uncertainty $\sigma_i$ (the standard case in astronomy, where each data point has a reported error bar).

$$p(y_i \mid \theta) = \mathcal{N}(y_i \mid f(\theta), \sigma_i^2)$$

Unknown errors: the measurement uncertainty is itself unknown and must be inferred.

$$p(y_i \mid \theta, \sigma) = \mathcal{N}(y_i \mid f(\theta), \sigma^2), \quad \sigma \sim p(\sigma)$$

Intrinsic scatter: even if you know the measurement errors perfectly, the true relationship may have physical scatter $\sigma_\text{int}$ beyond measurement noise.

$$p(y_i \mid \theta, \sigma_\text{int}) = \mathcal{N}(y_i \mid f(\theta), \sigma_i^2 + \sigma_\text{int}^2)$$

Ignoring intrinsic scatter when it exists will cause your posterior to be overconfident—the posterior will be too narrow because you have attributed all variation to measurement noise.

3.3 The Log-Likelihood

In practice, we almost always work with the log-likelihood:

$$\log p(y \mid \theta) = \log \prod_{i=1}^{n} p(y_i \mid \theta) = \sum_{i=1}^{n} \log p(y_i \mid \theta)$$

(assuming the observations are conditionally independent given $\theta$). This is numerically more stable—probabilities of many individual observations can easily underflow to zero, but their logarithms sum to a finite number.

For a Gaussian likelihood with known errors:

$$\log p(y \mid \mu, \sigma) = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\sum_{i=1}^{n} \frac{(y_i - \mu_i)^2}{\sigma_i^2} - \sum_{i=1}^{n} \log \sigma_i$$

The middle term, $\sum (y_i - \mu_i)^2 / \sigma_i^2$, is the chi-squared statistic. Maximising the log-likelihood is equivalent to minimising chi-squared—this is why maximum likelihood estimation with Gaussian errors is equivalent to weighted least squares.

3.4 Example: The Period–Luminosity Relation for Cepheids

Cepheid variables follow a tight period–luminosity (Leavitt) relation:

$$M_i \sim \mathcal{N}(\alpha + \beta \log_{10}(P_i), \sigma_\text{int}^2 + \sigma_{M,i}^2)$$

where $M_i$ is the absolute magnitude of the $i$th Cepheid, $P_i$ its period in days, $\sigma_{M,i}$ the measurement uncertainty on $M_i$, and $\sigma_\text{int}$ intrinsic scatter around the mean relation.

The log-likelihood is:

$$\log p({M_i} \mid \alpha, \beta, \sigma_\text{int}) = -\frac{1}{2}\sum_{i=1}^{n} \left[\frac{(M_i - \alpha - \beta \log P_i)^2}{\sigma_\text{int}^2 + \sigma_{M,i}^2} + \log(2\pi(\sigma_\text{int}^2 + \sigma_{M,i}^2))\right]$$

Note that $\sigma_\text{int}$ appears inside the log in the second term. This penalises large values of $\sigma_\text{int}$—wide Gaussians have lower likelihood per observation, all else equal. Maximum likelihood will find the value of $\sigma_\text{int}$ that best describes the residuals after fitting $\alpha$ and $\beta$.

ℹ️

Why does $\sigma_\text{int}$ appear in the normalisation term?

When we write $M_i \sim \mathcal{N}(\mu_i, \sigma_i^2)$, the full Gaussian density is:

$$p(M_i \mid \mu_i, \sigma_i) = \frac{1}{\sqrt{2\pi\sigma_i^2}} \exp\left[-\frac{(M_i - \mu_i)^2}{2\sigma_i^2}\right]$$

The $\frac{1}{\sqrt{2\pi\sigma_i^2}}$ normalisation depends on $\sigma_i$. If you ignore this term (as some textbooks do when $\sigma_i$ is known and fixed), you are implicitly treating $\sigma_i$ as constant and dropping a constant from the log-likelihood. When $\sigma_i$ contains an unknown parameter like $\sigma_\text{int}$, you must include the normalisation—otherwise the likelihood is not a proper density and your inference will be wrong.


4. The Prior Distribution

The prior $p(\theta)$ encodes your knowledge about the parameters before seeing the data. It is often described as “subjective,” but this framing is misleading. The prior is a formal statement of domain knowledge—what physically plausible values of $\theta$ look like based on everything we know that is independent of the current dataset.

4.1 A Taxonomy of Priors

The broader Bayesian literature distinguish several types of priors:

Proper priors integrate to 1: $$\int p(\theta) d\theta = 1$$

Improper priors have no finitely defined integral: $$\int p(\theta) d\theta = \infty$$

Improper priors (such as the flat prior $p(\theta) \propto 1$ on $\mathbb{R}$) can sometimes be used in practice as long as the posterior is proper—but this requires verification, and it is easy to get wrong. In this course, use proper priors unless you have a specific reason not to. And remember: you can use a proper prior that has a finite integral even with very wide support (e.g., $\theta \sim U(-10^{40}, +10^{40})$ has a finite integral, even if the probability density at any position cannot be computed with float-32 bit precision).

Informative priors concentrate on a specific region of parameter space based on strong prior knowledge—for instance, a prior on $H_0$ based on Planck CMB measurements.

Weakly informative priors rule out physically absurd values while remaining broad enough to let the data speak. This is the recommended default for most astronomical analyses.

Non-informative (flat) priors attempt to express complete ignorance. They are rarely truly non-informative: a flat prior on $\sigma$ is not flat on $\log \sigma$, and vice versa. What is “flat” depends on the parameterisation.

Conjugate priors have the same functional form as the posterior, yielding analytical solutions. In practice, most real astronomical problems are not conjugate, but conjugate pairs build intuition.

Figure 2

Figure 2: The prior spectrum from uninformative (flat, purple) to weakly informative (shaded band, blue) to informative (narrow peak, orange). The true parameter value is shown as a dashed vertical line. With a flat prior, the posterior is driven entirely by the likelihood. With a weakly informative prior, the posterior is pulled only slightly toward the prior mode. With a strongly informative prior, the posterior barely moves from the prior.

4.2 The Two Failure Modes of Prior Specification

Every prior specification failure falls into one of the following categories:

Informationally vacuous priors (too wide): the prior assigns substantial probability to physically impossible or empirically absurd values. The prior contributes essentially no information.

Example: A prior $\sigma \sim \text{Normal}(0, 100)$ for an intrinsic scatter parameter in a luminosity–distance relation. This allows $\sigma = 150$ magnitudes of scatter—larger than the luminosity range of all known galaxy types combined. The prior is vacuous.

Consequence: In regions of sparse data or complex posterior geometry, the sampler will explore regions of parameter space that domain knowledge rules out. Posteriors may be inflated and difficult to sample.

Informationally restrictive priors (too narrow): the prior concentrates so strongly that it overwhelms the data.

Example: A prior $H_0 \sim \mathcal{N}(67.4, 0.01^2)$ informed by one particular CMB analysis. This says that a value of $H_0 = 70$ km/s/Mpc—well within the uncertainty of many distance-ladder estimates—is tens of thousands of standard deviations away from the prior mean. No amount of data could shift this posterior.

Consequence: The posterior looks precise but is biased toward the prior. You are fitting your prior, not your data. You really have to believe someone’s reported posterior if you are to use it as a prior!

The target: weakly informative priors. A weakly informative prior rules out absurdities, allows the full range of physically plausible values, and is dominated by data of any reasonable size.

4.3 Prior Elicitation for Common Astronomy Parameters

Location parameters (intercepts, means): use a Normal distribution centred at a physically motivated reference value, with a width that covers the plausible range at $\pm 2\sigma$.

# Absolute magnitude intercept for Tully-Fisher
alpha = pm.Normal('alpha', mu=-21.0, sigma=2.0)
# This covers M_B in [-25, -17] at 2σ: plausible for spirals

Scale parameters (scatter, widths, standard deviations): must be positive. Parameterize it as uniform in $\log(\sigma)$, or use a HalfNormal or Exponential prior.

# flat in log space
ln_sigma = pm.Uniform('ln_sigma', -3, 1)

# Intrinsic scatter in a photometric redshift model
sigma_int = pm.HalfNormal('sigma_int', sigma=0.5)
# or equivalently:
sigma_int = pm.Exponential('sigma_int', lam=2.0)  # mean = 0.5

Slope parameters: use a Normal distribution. Think about plausible ranges from the literature or from physical bounds.

# Slope of the star-forming main sequence
beta = pm.Normal('beta', mu=0.8, sigma=0.3)
# Literature range: 0.6 to 1.0; allows down to 0.2 at 2σ

Probability parameters: constrained to $[0, 1]$, use a Beta distribution.

# Detection efficiency
eta = pm.Beta('eta', alpha=8, beta=2)
# Mean = 0.8, concentrated near high efficiency

Positive rate parameters: use a Gamma distribution.

# Rate of type Ia supernovae per year
lam = pm.Gamma('lam', alpha=4, beta=2)  # mean = 2, SD = 1

In all of these cases, you should actually check – through predictive prior checks – that your prior is behaving as you expect it to be! Don’t simply use the guide above to tell you what the prior family is that you should use for your bespoke problem.

4.4 Encoding Prior Constraints

When a parameter has a hard physical bound, encode it in the prior. Do not use an unconstrained prior and hope the likelihood keeps you away from forbidden regions—it may not, especially in sparse-data settings.

Positivity constraints:

# Distance modulus: must be positive
mu_dist = pm.HalfNormal('mu_dist', sigma=10)
# Or with a sensible lower bound:
mu_dist = pm.TruncatedNormal('mu_dist', mu=33, sigma=5, lower=0)

Theoretical bounds:

# Equation of state parameter w: general relativity + accelerating expansion require w < -1/3
w = pm.TruncatedNormal('w', mu=-1, sigma=0.3, upper=-1/3)
⚠️

Priors should be evaluated in observation space, not parameter space.

If you are uncertain whether a prior is reasonable, do not ask “does this range of $\theta$ seem plausible?” Ask “does this prior, when combined with the likelihood, generate data that looks like real observations?” This is the purpose of prior predictive checks. Set up the prior, simulate data from the joint distribution, and examine the simulated data against your domain knowledge.

4.5 Using Previous Measurements as Priors

One of the most powerful features of Bayesian inference is the ability to incorporate previous analyses directly as priors. If a previous study measured $H_0 = 74.0 \pm 1.4$ km/s/Mpc, you can use:

$$H_0 \sim \mathcal{N}(74.0, 1.4^2)$$

as the prior in your new analysis.

This is appropriate when:

  • The previous dataset is independent of yours (different sources, different observations)
  • The previous analysis used a compatible model
  • You trust the quoted uncertainties

It is not appropriate to use your own data to set a prior and then fit the same data again—this double-dips and produces overconfident posteriors.

Sequential updating in practice: suppose you observe $N_1 = 10$ quasars in the first semester of a survey and obtain a posterior $p(\Lambda \mid y_1)$ for the quasar luminosity function slope. In the second semester, you observe $N_2 = 15$ more quasars. You can use the first semester’s posterior directly as the prior for the second semester’s analysis:

$$p(\Lambda \mid y_1, y_2) \propto p(y_2 \mid \Lambda) p(\Lambda \mid y_1)$$

The result is identical to analysing all 25 quasars at once—provided the observations are independent. Bayesian inference is path-independent for independent data.


5. Computing the Posterior

For almost all real astronomical models, the posterior has no analytical form. We need numerical methods.

5.1 Three Families of Methods

Markov Chain Monte Carlo (MCMC) generates a large collection of samples ${\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(S)}}$ that are distributed according to the posterior $p(\theta \mid y)$. Any posterior quantity—mean, credible interval, marginal distribution—can then be estimated from these samples. MCMC is the workhorse of Bayesian inference in astronomy and the method we will use throughout this course. We will cover how it works in depth in later lectures.

Laplace approximation fits a multivariate Gaussian to the posterior, centred at the posterior mode. It is fast but only accurate for nearly Gaussian posteriors.

Variational inference (VI) finds the closest simple distribution to the posterior in a mathematical sense. Faster than MCMC but approximate; can miss important posterior structure.

5.2 Convergence Diagnostics: What to Check

MCMC is not guaranteed to work correctly just because it finishes. MCMC will always give you a set of numbers. After sampling, you must check that the chains have actually converged to the posterior.

We will cover these in more detail in future lectures, but this is to give you some feeling for the kinds of problems you may encounter with MCMC. Essential diagnostics include:

$\hat{R}$ (Potential Scale Reduction Factor)

MCMC is run as multiple independent chains—each chain starts from a different random initial point and samples independently. If all chains have converged to the same posterior, they should all look like they were drawn from the same distribution.

$\hat{R}$ compares variability within chains to variability between chains. When the chains agree:

  • $\hat{R} \approx 1.0$: chains have converged ✓
  • $\hat{R} > 1.01$: chains disagree; do not trust the posterior

Effective Sample Size (ESS)

MCMC samples are not independent—each sample is correlated with the previous one. The effective sample size (ESS) is the number of independent samples the chain is equivalent to.

  • ESS $> 100$ per parameter: sufficient for reliable estimates ✓
  • ESS $< 100$: posterior summaries (especially tail quantiles) are unreliable ✗

We track bulk-ESS (accuracy of the posterior mean) and tail-ESS (accuracy of credible interval edges) separately.

Divergences

MCMC also produces a warning signal called a divergence when the sampler detects it may not be exploring some region of the posterior correctly. Even a single divergence is cause for concern.

  • Divergences $= 0$: ✓
  • Any divergences: investigate before reporting results ✗
# All three diagnostics at once
print(az.summary(trace, var_names=['alpha', 'beta', 'sigma']))
# Inspect the r_hat, ess_bulk, and ess_tail columns

# Trace plots: each chain should look like a stationary random signal
az.plot_trace(trace)

Figure 3

Figure 3: Top: well-mixed chains for a single parameter—stationary, overlapping, no trends. This is what convergence looks like. Bottom: three pathological patterns—a stuck chain (red), a trending chain (blue), a chain with a sudden jump (green). All three have $\hat{R} > 1.01$.

5.3 When Diagnostics Fail

“When you have computational problems, often there is a problem with your model.” — Gelman et al.

This is the folk theorem of statistical computing. Sampler failures are usually symptoms of model problems, not sampler inadequacies. Before blaming the algorithm, ask whether the model itself is well specified.

SymptomLikely causeFirst action
High $\hat{R}$Posterior has multiple modes; chains stuck in different onesCheck model for identifiability issues
Low ESSHighly correlated samples; slow mixingRun more samples; revisit model
DivergencesPosterior has difficult geometry in some regionRevisit prior specification
Extreme parameter valuesPrior is vacuousTighten priors (return to Section 4)
Bimodal traceMultiple posterior modesCheck for label-switching or structural identifiability

6. The Posterior Predictive Distribution

Once you have the posterior $p(\theta \mid y)$, you can ask: if you were to repeat the experiment, what data would you expect to see?

This question is answered by the posterior predictive distribution:

$$p(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta) p(\theta \mid y) d\theta$$

where $\tilde{y}$ denotes new (replicated) data. The posterior predictive distribution averages the likelihood $p(\tilde{y} \mid \theta)$ over all parameter values, weighted by how probable each $\theta$ is given the data.

6.1 Interpreting the Posterior Predictive

The posterior predictive $p(\tilde{y} \mid y)$ is:

  • A distribution over data, not over parameters
  • Wider than the likelihood at any single $\theta$ because it integrates over parameter uncertainty
  • A complete description of what your model expects future observations to look like

It has three sources of uncertainty:

  1. Sampling variability: even if we knew $\theta$ exactly, new observations would vary due to the likelihood
  2. Parameter uncertainty: we do not know $\theta$ exactly; the posterior spread contributes additional uncertainty
  3. Epistemic uncertainty: who knows whether your model is the true model or not? If your model is misspecified, then it will not capture the variance in the data (e.g., see posterior predictive checks).

In practice, we approximate $p(\tilde{y} \mid y)$ by sampling:

# After fitting:
with model:
    ppc = pm.sample_posterior_predictive(trace, random_seed=42)

# ppc contains samples of ỹ from p(ỹ | y)
# Shape: (chains * draws, n_observations)
y_rep = ppc.posterior_predictive['y_obs'].values

Each row of y_rep is a complete replicated dataset—a single draw from the posterior predictive. The collection of these rows is your approximation to $p(\tilde{y} \mid y)$.

6.2 Predictive Uncertainty vs. Parameter Uncertainty

A common error is to report parameter uncertainty (the width of the posterior on $\theta$) when the scientific question calls for predictive uncertainty (the width of the posterior predictive on $\tilde{y}$).

Example: You fit a linear Hubble law to 20 Type Ia supernovae and obtain a tight posterior on $H_0 = 73.2 \pm 0.8$ km/s/Mpc. A colleague asks: “If I observe a new SN Ia at redshift $z = 0.05$, what distance modulus do you predict?”

The answer is not $\mu = 5\log_{10}(cz/H_0) \pm 0.03$ mag (the uncertainty from $H_0$ alone). The answer includes:

  • Parameter uncertainty on $H_0$ (propagated through the distance formula)
  • Intrinsic scatter of the SN Ia standardised candle (~0.15 mag)
  • Measurement error on the new observation

The posterior predictive correctly combines all of these. The $H_0$ posterior credible interval does not.


7. Posterior Predictive Checks

The posterior predictive distribution gives us a way to test whether our model actually fits the data. A posterior predictive check (PPC) compares the real observed data to data simulated from the fitted model.

The logic is simple: if the model is correctly specified, the real data should look like a typical draw from the posterior predictive. If the real data looks unlike anything the posterior predictive would generate, the model is misspecified in some detectable way.

7.1 Test Statistics

Define a test statistic $T(y)$—a function of the data that captures some aspect of interest. Compute $T$ on the real data to get $T_\text{obs} = T(y_\text{obs})$. Compute $T$ on each replicated dataset to get the distribution ${T(\tilde{y}^{(s)})}_{s=1}^{S}$. Compare.

Good test statistics:

  • Are designed to be sensitive to specific model failures
  • Are defined before seeing the posterior (no post-hoc selection)
  • Are computable directly from observations (no latent variables)

Example test statistics for a parallax model:

  • $T_1 = \text{mean residual}$: checks for a systematic offset in parallax zero-point
  • $T_2 = \text{standard deviation of residuals}$: checks whether scatter matches the measurement uncertainties
  • $T_3 = \text{skewness of residuals}$: checks for asymmetry (the Gaussian likelihood assumes zero skew)
  • $T_4 = \text{fraction of outliers beyond } 3\sigma$: checks for heavy tails consistent with contaminant stars

7.2 The Posterior Predictive P-value

The posterior predictive p-value (PPP-value) is:

$$p_B = P(T(\tilde{y}) \geq T(y_\text{obs}) \mid y_\text{obs}) = \int \mathbf{1}[T(\tilde{y}) \geq T(y_\text{obs})] p(\tilde{y} \mid y_\text{obs}) d\tilde{y}$$

A PPP-value near 0.5 means the real statistic falls near the centre of the posterior predictive distribution—the model is not failing on this test.

A PPP-value near 0 or 1 means the real statistic is an extreme value relative to what the model predicts—evidence of model misspecification.

⚠️
The PPP-value is not a frequentist p-value. It does not have a uniform distribution under $H_0$, and it does not have the same interpretation as a classical p-value. A PPP-value near 0.05 does not mean your model is “barely significant”—it means the observed statistic is at the 5th percentile of the posterior predictive distribution. Values of 0.05–0.95 generally indicate no serious misspecification. Values near 0 or 1 deserve investigation.

7.3 Visual Checks

Numerical PPP-values are useful, but visual checks are often more informative. The most powerful plots for PPPCs:

Predictive distribution overlays: plot the histogram of the real data overlaid with 100 histograms of replicated datasets. If the real histogram is consistently outside the range of replicated histograms, the model is misspecified.

Residual plots: plot $(y_i - \hat{y}_i) / \hat{\sigma}_i$ against covariates, predicted values, or time. Residuals should look like independent draws from $\mathcal{N}(0, 1)$ if the model is correct.

Quantile-quantile (Q-Q) plots: compare the empirical quantiles of the data to the quantiles predicted by the model. Deviations from the diagonal indicate distribution misspecification.

Conditional predictive checks: split the data by a covariate (e.g., redshift bins, mass bins) and check whether the model fits each subset. The model may pass a global check but fail within subpopulations.

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Predictive distribution overlay
ax = axes[0]
for s in range(100):
    ax.hist(y_rep[s], bins=30, alpha=0.05, color='steelblue', density=True)
ax.hist(y_obs, bins=30, alpha=0.8, color='tomato', density=True, label='Observed')
ax.set_xlabel('Observed value')
ax.set_title('Predictive distribution check')
ax.legend()

# 2. PPP-value for standard deviation
T_obs = np.std(y_obs)
T_rep = np.std(y_rep, axis=1)
ppp = np.mean(T_rep >= T_obs)
ax = axes[1]
ax.hist(T_rep, bins=30, color='steelblue', density=True)
ax.axvline(T_obs, color='tomato', lw=2, label=f'Observed SD\nPPP = {ppp:.2f}')
ax.set_xlabel('SD of replicated datasets')
ax.set_title('PPP-value for standard deviation')
ax.legend()

# 3. Residuals vs. predicted values
y_pred_mean = y_rep.mean(axis=0)
y_pred_std  = y_rep.std(axis=0)
residuals = (y_obs - y_pred_mean) / y_pred_std
ax = axes[2]
ax.scatter(y_pred_mean, residuals, alpha=0.6, s=10)
ax.axhline(0, color='gray', lw=1)
ax.axhline(2, color='tomato', lw=1, linestyle='--')
ax.axhline(-2, color='tomato', lw=1, linestyle='--')
ax.set_xlabel('Predicted mean')
ax.set_ylabel('Standardised residual')
ax.set_title('Residuals vs. predicted values')

plt.tight_layout()

Figure 4

Figure 4: Three posterior predictive checks for a galaxy photometric redshift model. Left: the predictive distribution (blue ribbons) brackets the observed histogram (red), but slightly underestimates the low-$z$ peak—a mild failure. Middle: the observed standard deviation (red line) falls at the 12th percentile of the posterior predictive SD distribution—this is within acceptable range. Right: residuals show two systematic outliers at the highest predicted values—likely catastrophic photo-$z$ failures not captured by the model.

7.4 What To Do When a PPC Fails

A failed PPC tells you the model is misspecified in a specific, detectable way. Different failure modes suggest different fixes:

PPC failureLikely causeSuggested action
Predicted mean systematically offMissing covariate; wrong reference valueAdd covariate; respecify intercept
Predicted scatter too smallIntrinsic scatter not modelledAdd $\sigma_\text{int}$ to the model
Predicted scatter too largeLikelihood too diffuse; outliers driving fitInvestigate outliers; use robust likelihood (e.g., Student-$t$) or model outliers
Heavy tails in residualsNon-Gaussian errorsReplace Normal likelihood with Student-$t$ or model outliers
Systematic residuals vs. covariateNon-linear relationshipAdd polynomial or spline terms
Subpopulation failureMixture populationModel as mixture or add subpopulation indicator

A PPC failure is not a reason to abandon your analysis—it is diagnostic information. It tells you specifically what is wrong and suggests a path forward. The workflow then cycles: revise the model , run the prior predictive check on the revised model, refit, recheck.


8. A Complete Example: Stellar Distances from Parallax

The parallax example from Eadie et al. (2023) threads through the entire paper, illustrating each component of Bayesian inference on a problem every astronomer knows. We reproduce it here because it contains a lesson that no amount of abstract explanation can convey: the same data, combined with different priors, produces meaningfully different posteriors—and that is not a bug, it is a feature.

8.1 The Scientific Problem

A star at distance $d$ (in kpc) has a true parallax $\varpi = 1/d$ (in milliarcseconds, mas). We want to know the star’s distance, but we do not measure it directly: we measure the parallax $\varpi$. And in fact, we do not observe $\varpi$ directly—we observe a noisy measurement $y$ with known measurement uncertainty $\sigma$.

Parameter: $d$ (distance in kpc), equivalently $\varpi = 1/d$ (parallax in mas)

Data: a single parallax measurement $y$ with known $\sigma$ (e.g., from Gaia DR2)

Relationship: $d[\text{kpc}] = 1 / \varpi[\text{mas}]$

This is a deceptively simple setup. The naive estimate of distance is $\hat{d} = 1/y$—just invert the measured parallax. But this ignores:

  1. The asymmetric propagation of uncertainty through the inversion ($\sigma_d = \sigma_\varpi / \varpi^2$, which diverges as $\varpi \to 0$)
  2. The fact that there are more stars at larger distances (volume increases as $r^2$), so we expect nearby stars to be a minority of any flux-limited sample

Both of these corrections require a proper Bayesian treatment.

8.2 The Likelihood

The Gaia parallax measurement error is well-characterised as Gaussian. The likelihood in parallax space is:

$$y \mid \varpi, \sigma \sim \mathcal{N}(\varpi, \sigma^2)$$

or written out in full:

$$p(y \mid \varpi, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{(y - \varpi)^2}{2\sigma^2}\right]$$

Substituting $\varpi = 1/d$ to re-express in terms of distance:

$$p(y \mid d, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{(y - 1/d)^2}{2\sigma^2}\right]$$

Key point: the likelihood is Gaussian in $y$ for fixed $d$, but it is not Gaussian as a function of $d$ for fixed $y$. The non-linear transformation $\varpi = 1/d$ distorts the shape. This is shown in Figure 5 (centre-left panel): the likelihood as a function of distance is asymmetric, with a longer tail toward larger $d$.

8.3 Three Candidate Priors

This is where the interesting choices arise. What do we believe about the distance to a randomly selected star before we look at the parallax?

Prior 1: Uniform in parallax

$$p(\varpi) \propto 1, \quad \varpi \in [\varpi_\text{min}, \varpi_\text{max}]$$

or in distance space (via the change-of-variables $p(d) = p(\varpi) |d\varpi/dd|$):

$$p(d) \propto \frac{1}{d^2}$$

This prior over-weights nearby stars. It is rarely physically motivated.

Prior 2: Uniform in distance

$$p(d) \propto 1, \quad d \in [d_\text{min}, d_\text{max}]$$

A flat prior over distance. This sounds like “maximum ignorance,” but it is actually informative: it says each kilometre of distance is equally likely, regardless of how many stars are there. In a galaxy with stars distributed throughout a disk, this is not a good description of reality.

Prior 3: Physically-motivated (Bailer-Jones et al. 2018)

$$p(d) \propto d^2 \exp\left(-d / L\right), \quad d \in [d_\text{min}, d_\text{max}]$$

This is the exponentially decreasing space density prior. The $d^2$ factor comes from the volume element of a spherical shell: at distance $d$, the number of stars in a thin shell of width $dd$ is proportional to $4\pi d^2 \cdot n \cdot dd$, where $n$ is the local number density. The $\exp(-d/L)$ factor accounts for the fact that stellar density drops at large distances (the disk scale length $L$, typically a few kpc). The hyperparameter $L$ must be chosen from prior knowledge of the Milky Way structure.

ℹ️

How the choice of $L$ encodes domain knowledge

The scale length $L$ cannot be chosen arbitrarily. It represents real knowledge about the Galaxy:

  • For nearby thin-disk stars, $L \sim 1$–$2$ kpc is reasonable
  • For halo stars or deep pencil-beam surveys, larger $L$ is appropriate
  • Choosing $L$ too small biases distances toward nearby estimates; too large and the exponential cutoff does little work

This is prior elicitation in action: you are encoding what you know about the three-dimensional distribution of stars in the Milky Way into the inference about a single star’s distance.

8.4 The Three Posteriors

Applying Bayes’ theorem, the three posteriors are:

$$p(d \mid y) \propto p(y \mid d) p(d)$$

With the uniform parallax prior ($p(d) \propto 1/d^2$):

$$p(d \mid y) \propto \frac{1}{d^2} \exp\left[-\frac{(y - 1/d)^2}{2\sigma^2}\right]$$

With the uniform distance prior ($p(d) \propto 1$):

$$p(d \mid y) \propto \exp\left[-\frac{(y - 1/d)^2}{2\sigma^2}\right]$$

With the Bailer-Jones prior ($p(d) \propto d^2 e^{-d/L}$):

$$p(d \mid y) \propto d^2 \exp\left(-\frac{d}{L} - \frac{(y - 1/d)^2}{2\sigma^2}\right)$$

None of these have closed-form normalising constants—they must be evaluated numerically.

Figure 5

Figure 5 (reproduced from Eadie et al. 2023, Figure 1): Four panels showing Bayesian inference of distance $d = 1/\varpi$ from a measured parallax $y$. Far left: the likelihood is Gaussian in parallax. Centre-left: re-expressed as a function of distance, the likelihood becomes non-Gaussian and asymmetric. Centre-right: three candidate priors over distance. Far right: the three corresponding posteriors. The posteriors differ substantially, especially in the tails.

The key insight from this figure: the three posteriors differ substantially. The uniform-parallax posterior (blue) is shifted toward smaller distances; the uniform-distance posterior (red) is most spread; the Bailer-Jones posterior (orange) is the most compact and physically motivated. Which one you report is a modelling choice, and it matters.

A natural summary statistic for each posterior is different, too. For a skewed distribution like these:

  • The mode (peak of the posterior) answers “what is the single most probable distance?”
  • The median answers “what distance has equal probability mass on each side?”
  • A credible interval answers “what range contains 90% of my posterior probability?”

Figure 6

Figure 6 (reproduced from Eadie et al. 2023, Figure 2): The three posteriors from Figure 4 with summary statistics overlaid: mode (dotted), median (dot-dash), and 90% credible interval (dashed lines with shading). The summaries differ across posteriors and differ from one another within the same posterior—an asymmetric posterior cannot be adequately summarised by a single point estimate.

8.5 Extended Example: Distance to the Cluster M67

Now suppose instead of a single star we have $n$ stars that are all members of the same open cluster, each at approximately the same distance $d_\text{cluster}$. Each star $i$ has a Gaia parallax measurement $y_i$ with known uncertainty $\sigma_i$.

The combined likelihood (assuming measurements are independent given the cluster distance):

$$p(y_1, \ldots, y_n \mid d_\text{cluster}) = \prod_{i=1}^{n} \mathcal{N}\left(y_i \mid \frac{1}{d_\text{cluster}}, \sigma_i^2\right)$$

The posterior:

$$p(d_\text{cluster} \mid y_1, \ldots, y_n) \propto p(d_\text{cluster}) \prod_{i=1}^{n} p(y_i \mid d_\text{cluster})$$

With Gaussian likelihoods, the product simplifies via sufficient statistics. Define the effective parallax $\varpi_\text{eff}$ and combined precision $\tau^{-2}$:

$$\tau^{-2} = \sum_{i=1}^{n} \sigma_i^{-2}, \qquad \varpi_\text{eff} = \tau^2 \sum_{i=1}^{n} \frac{y_i}{\sigma_i^2}$$

Then the posterior becomes:

$$p(d_\text{cluster} \mid \mathbf{y}) \propto p(d_\text{cluster}) \exp\left[-\frac{(\varpi_\text{eff} - 1/d_\text{cluster})^2}{2\tau^2}\right]$$

This is the same form as the single-star posterior, but with the individual measurement $y$ replaced by the precision-weighted combination $\varpi_\text{eff}$, and the individual uncertainty $\sigma$ replaced by the combined uncertainty $\tau$. Combining $n$ stars reduces the effective uncertainty by approximately $\sqrt{n}$.

⚠️

The cluster prior is not the same as the single-star prior. For a single star, we used a prior that encodes the spatial distribution of all stars in the Galaxy—far away stars are more numerous. For a cluster, we already know it is a cluster, and we have prior information about where M67 is located (it is a well-studied nearby cluster). The cluster-level prior should encode that specific knowledge, not the generic all-sky stellar distribution.

Swapping in the single-star prior for the cluster prior would be a modelling error: you would be encoding the wrong domain knowledge.

Eadie et al. (2023) demonstrate this with real Gaia DR2 data for M67 members, selected via proper motion criteria. Figure 6 shows the effect of sequentially adding more cluster members: with few stars, the prior substantially shapes the posterior; as $n$ grows, the combined parallax information dominates and the posterior tightens around the true cluster distance.

Figure 7

Figure 7 (reproduced from Eadie et al. 2023, Figure 3): Top: Gaia parallax measurements for M67 members, sorted by signal-to-noise ratio $y_i / \sigma_i$. Bottom: the posterior for the cluster parallax $1/d_\text{cluster}$ as more stars are added sequentially. Left panel shows the Gaussian prior. When few stars are included, the prior location is important; as more are added, the data dominate and the posterior narrows.

8.6 Posterior Predictive Check

After fitting the M67 cluster model, we ask: does the fitted model generate data that looks like the observed parallaxes?

The procedure:

  1. Draw a cluster distance $d^{(s)} \sim p(d_\text{cluster} \mid \mathbf{y})$ from the posterior
  2. For each star $i$, simulate a predicted parallax $\tilde{y}_i^{(s)} \sim \mathcal{N}(1/d^{(s)}, \sigma_i^2)$
  3. Repeat $S$ times to build up the posterior predictive distribution

The result is a collection of simulated datasets ${\tilde{y}_i^{(s)}}$, each drawn from the posterior predictive $p(\tilde{y} \mid \mathbf{y})$.

Eadie et al. (2023) compare the observed parallax distribution to the posterior predictive in two ways:

Density overlay (Figure 8, top): plot the histogram of observed parallaxes ${y_i}$ alongside the posterior predictive distribution. By eye, the two look roughly consistent.

Q-Q plot (Figure 8, bottom): plot the quantiles of the observed data against the quantiles of the posterior predictive. If both follow the same distribution, the points should fall on the 1:1 line.

The Q-Q plot reveals what the density overlay conceals: there are clear systematic departures from the 1:1 line below the 10th percentile and above the 70th percentile. The tails of the observed parallax distribution do not match what the model predicts.

Figure 8

Figure 8 (reproduced from Eadie et al. 2023, Figure 4): Top: the observed parallax distribution (gray) and posterior predictive (light blue). They look roughly consistent by eye. Bottom: the Q-Q plot comparing quantiles of the posterior predictive (x-axis) to quantiles of the observations (y-axis). Strong deviations are apparent below ~10th percentile and above ~70th percentile—evidence of model misspecification the density plot missed.

What does this tell us? The Gaussian likelihood assumption—that measurement errors are perfectly Normal—may not hold for all M67 members. Some stars may be non-members contaminating the sample despite the proper-motion cut. Others may have underestimated or overestimated errors. The PPC, via the Q-Q plot, has identified a specific, actionable failure mode: the tails of the parallax distribution are inconsistent with the model. The next step is to investigate—perhaps by adding a mixture component for contaminant stars, or by using a more robust (heavier-tailed) likelihood.

This is the workflow in action: the prior predictive check validated the setup, the posterior converged, but the posterior predictive check revealed that the model does not fully capture the data. The inference is not finished until the PPC passes.


9. Recommendations: A Practical Checklist

Based on Eadie et al. (2023), here is a summary checklist for every Bayesian analysis.

Before fitting:

  • Always start with the simplest model you can, and build complexity
  • Likelihood matches the data type (not Normal for counts or positive-only quantities)
  • Heteroscedastic errors handled per-observation, not as a single $\sigma$
  • Intrinsic scatter included if data exceed measurement error predictions
  • All priors are proper and justified by domain knowledge or literature
  • No vacuous priors (spanning physically impossible ranges)
  • No restrictive priors (so tight that data cannot shift them)
  • Prior predictive check passed: simulated data falls in plausible range

During optimization:

  • Can the model be represented in a convex way?
  • Is the optimization failing to converge?
  • Is the optimizer taking appropriate steps in parameter space?
  • Does the optimized result change if I change the initial guess?

During MCMC:

  • $\hat{R} \leq 1.01$ for all parameters
  • Bulk-ESS $> 100$ for all parameters (ideally $> 400$)
  • Tail-ESS $> 100$ for all parameters
  • Zero divergences (or explained and resolved)
  • Trace plots show stationary, mixing chains

After fitting:

  • Posterior predictive check conducted with pre-specified test statistics
  • Visual checks (distribution overlay, residuals, Q-Q) performed
  • PPP-values computed and reported
  • If PPC fails: model revised and workflow repeated from Phase 2
  • If comparing models: LOO or WAIC computed; uncertainty on comparison reported

Communicating your results:

  • Specify the data shape, and their uncertainties
  • Describe your data-generating model
  • List all parameters with a short description, their priors, and units
  • Do prior predictive checks to test the choices of priors, and that they can generate sensible-looking data
  • Specify the likelihood function
  • Generate toy data and show that you can recover it without bias (this tests many things about optimization, inference, and model capacity)
  • Posterior predictive checks with pre-specified test statistics
  • What are you communicating about your posteriors? If it is not something useful, then why did you go to the effort of sampling?

10. Summary

Bayesian inference has three pillars: the likelihood, the prior, and the posterior. Getting any one of them wrong can silently invalidate your conclusions. The tools introduced in this lecture—posterior predictive checks, convergence diagnostics, and model comparison—are how you detect and fix these failures.

The likelihood is your model of the data-generating process. It must respect the data type, handle measurement errors correctly, and include all sources of variance. Using a Normal distribution for positive-definite quantities is a mistake. Ignoring intrinsic scatter is a mistake. Treating the likelihood as a probability distribution over parameters is a mistake.

The prior is your formal statement of domain knowledge. It should be proper, weakly informative, and evaluated by what it implies in observation space—not parameter space. There are only two ways a prior goes wrong: too wide (vacuous, contributing nothing) or too narrow (restrictive, overriding your data). Both are detectable through prior predictive checks.

Computing the posterior often (but not always) requires numerical methods. If you’re forced to use MCMC, then you should check it has not failed silently: check $\hat{R}$, ESS, and divergences—all three, for every parameter. Computational problems often signal model problems; treat them as diagnostic information.

The posterior predictive distribution is the bridge between your fitted model and the observable world. Posterior predictive checks compare the real data to data generated from the fitted model, using test statistics defined in observation space. A failed check diagnoses misspecification specifically and points toward the right fix.

The Bayesian workflow—prior to posterior to predictive check, iterating as needed—is a systematic protection against fooling yourself with your own model. Every step described in this lecture has its place in that cycle.


Further Reading

The primary reference for this lecture:

  • Eadie, G. M., Speagle, J. S., Cisewski-Kehe, J., et al. (2023). Practical Guidance for Bayesian Inference in Astronomy. arXiv:2302.04703. [The paper this lecture is based on—10 pages, read it in full]

Broader workflow references:

  • Betancourt, M. (2020). Towards a Principled Bayesian Workflow. betanalpha.github.io/case-studies. [The inspiration for our workflow structure]
  • Gelman, A., Vehtari, A., et al. (2020). Bayesian Workflow. arXiv:2011.01808. [The comprehensive 77-page reference]

Diagnostics:

  • Vehtari, A., Gelman, A., Simpson, D., et al. (2021). Rank-normalization, folding, and localization: An improved $\hat{R}$. Bayesian Analysis, 16(2), 667–718. [The current best-practice $\hat{R}$]
  • Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733–807. [The PPP-value framework]

Software:

  • PyMC documentation: pm.sample, pm.sample_prior_predictive, pm.sample_posterior_predictive
  • ArviZ: az.summary, az.plot_trace, az.plot_ppc, az.loo, az.compare
Last updated on