Bayesian Updating in Practice
By the end of this lecture, you will understand how to perform Bayesian updates in practice for both discrete and continuous parameters. You’ll learn to apply the elegant conjugate prior-likelihood pairs including Beta-Binomial, Gamma-Poisson, and Normal-Normal models. When conjugacy isn’t available, you’ll use grid approximation for non-conjugate problems. You’ll also develop the skill to elicit reasonable priors from domain knowledge, perform sequential updating where yesterday’s posterior becomes today’s prior, interpret credible intervals correctly, and conduct prior sensitivity analysis to assess how robust your conclusions are to prior choices.
1. Bayes’ Theorem
Last lecture, we introduced Bayes’ theorem from the product rule:
$$P(\theta \text{ and } D) = P(\theta \mid D) \cdot P(D) = P(D \mid \theta) \cdot P(\theta)$$
$$P(\theta \mid D) \cdot P(D) = P(D \mid \theta) \cdot P(\theta)$$
$$P(\theta \mid D) = \frac{P(D \mid \theta) \cdot P(\theta)}{P(D)}$$
Each component has a specific role in Bayesian inference. The parameter $\theta$ represents the hypothesis or quantity we want to learn about, while $D$ represents our observed data. Our prior $P(\theta)$ encodes our belief before seeing data. The likelihood $P(D \mid \theta)$ describes the probability of observing the data given a particular parameter value. After observing data, we update our beliefs to the posterior $P(\theta \mid D)$, which represents our updated belief after seeing the evidence. Finally, the evidence 1 $P(D)$ serves as a marginal likelihood that normalizes the posterior distribution.
Marginal likelihood (summing/integrating over all hypotheses):
$$P(D) = \sum_\theta P(D \mid \theta) \cdot P(\theta) \quad \text{(discrete)}$$
$$P(D) = \int P(D \mid \theta) \cdot P(\theta) d\theta \quad \text{(continuous)}$$
Proportional form (often more useful):
$$P(\theta \mid D) \propto P(D \mid \theta) \cdot P(\theta)$$
We can ignore the evidence $P(D)$ and normalize at the end.
Today’s focus: How to actually compute posteriors in practice, for various likelihood-prior combinations.
It may be worth having a brief aside about Bayesian versus Frequentist perspectives.
These are two different world-views about probability. The frequentist view describes how frequently some event will happen if you could run the universe many times over (which we cannot). The frequentist view has no concept of a prior belief, or on marginalizing (summing out) latent parameters. The Bayesian view describes how probable it is some event will take place, and we update our beliefs about the world based on prior information. In many cases the Bayesian view reduces to be the same as the frequentist view.
2. Discrete Bayesian Updating: A Step-by-Step Example
Let’s start with a fully worked discrete example. This builds intuition before we move to continuous distributions.
2.1 Problem Setup: Exoplanet Transit Detection
Scenario: You’re monitoring a star for planetary transits. Over several observing windows, you look for periodic dimming events consistent with a transiting planet.
Parameter: $\theta$ = probability of detecting a transit during an observing window (if a planet exists and is transiting).
Why $\theta$ varies:
- Weather (clouds block some observations)
- Instrumental effects (detector noise, systematics)
- Orbital geometry (impact parameter, limb darkening)
Simplification: We’ll use a discrete prior with only a few possible values of $\theta$. (In reality, $\theta$ is continuous, but discrete is easier to understand initially.)
Possible hypotheses:
- $\theta = 0.0$: No planet (all “detections” are noise)
- $\theta = 0.3$: Planet exists but weak signal (many misses)
- $\theta = 0.7$: Planet exists with strong signal (few misses)
- $\theta = 1.0$: Perfect detection (never miss a transit)
2.2 Prior Distribution
Before observing any transits, what’s our belief about $\theta$? Let’s construct a discrete prior distribution. We assign a 20% probability to $\theta = 0.0$ (no planet, all detections are noise), another 20% to $\theta = 0.3$ (planet exists but with weak signal), 30% to $\theta = 0.7$ (strong signal—most likely), and finally 30% to $\theta = 1.0$ (perfect detection, never miss a transit). These probabilities sum to 1.0 as required.
This prior reflects our belief before data. We think it’s slightly more likely to have a good signal ($\theta \geq 0.7$) than a weak signal, and we’re 80% confident a planet exists ($\theta > 0$).

Figure 1: Animation showing the discrete Bayesian update process as we observe data.
2.3 Data: Three Observations
During your observing run, you monitor the star during 3 predicted transit windows. In Window 1, you detect a transit. Window 2 also shows a detection. But in Window 3, you see no transit. This gives you 2 detections and 1 non-detection from 3 observing windows.
2.4 Likelihood Function
For each hypothesis $\theta$, we need to calculate how likely our observed data is. We assume that detections are independent events—each window is effectively a separate coin flip with probability $\theta$ of success. This leads to a binomial likelihood:
$$P(D \mid \theta) = \theta^k (1-\theta)^{n-k}$$
where:
- $n = 3$ (total observations)
- $k = 2$ (detections)
- $D = {\text{Yes, Yes, No}}$
For each hypothesis:
$$P(D \mid \theta) = \theta^2 (1-\theta)^1$$
Let’s compute:
| $\theta$ | Likelihood $\theta^2 (1-\theta)$ |
|---|---|
| 0.0 | $0.0^2 \times 1.0 = 0.000$ |
| 0.3 | $0.3^2 \times 0.7 = 0.063$ |
| 0.7 | $0.7^2 \times 0.3 = 0.147$ |
| 1.0 | $1.0^2 \times 0.0 = 0.000$ |
The likelihood values tell us something important. Both $\theta = 0.0$ and $\theta = 1.0$ have zero likelihood. The first is ruled out because we actually got detections, and the second is ruled out because we missed one transit. The hypothesis $\theta = 0.7$ has the highest likelihood (0.147), while $\theta = 0.3$ is possible but less likely (0.063).
2.5 Posterior Calculation
Apply Bayes’ theorem for each hypothesis:
$$P(\theta \mid D) = \frac{P(D \mid \theta) \cdot P(\theta)}{P(D)}$$
Step 1: Compute unnormalized posterior (prior × likelihood):
| $\theta$ | Prior $P(\theta)$ | Likelihood $P(D \mid \theta)$ | Prior × Likelihood |
|---|---|---|---|
| 0.0 | 0.2 | 0.000 | 0.0000 |
| 0.3 | 0.2 | 0.063 | 0.0126 |
| 0.7 | 0.3 | 0.147 | 0.0441 |
| 1.0 | 0.3 | 0.000 | 0.0000 |
Step 2: Compute evidence (sum of unnormalized posteriors):
$$P(D) = 0.0000 + 0.0126 + 0.0441 + 0.0000 = 0.0567$$
Step 3: Normalize (divide each unnormalized posterior by $P(D)$):
| $\theta$ | Unnormalized | Posterior $P(\theta \mid D)$ |
|---|---|---|
| 0.0 | 0.0000 | $0.0000 / 0.0567 = 0.000$ |
| 0.3 | 0.0126 | $0.0126 / 0.0567 = 0.222$ |
| 0.7 | 0.0441 | $0.0441 / 0.0567 = 0.778$ |
| 1.0 | 0.0000 | $0.0000 / 0.0567 = 0.000$ |
As expected, these normalized posteriors sum to 1.000.
2.6 Interpretation: What Did We Learn?
Before seeing any data, our prior suggested that $\theta = 0.7$ or $\theta = 1.0$ were most likely (30% each), with $\theta = 0.3$ somewhat plausible (20%) and $\theta = 0.0$ unlikely but not impossible (20%). After observing the data, the posterior distribution has changed dramatically. Now $\theta = 0.7$ dominates with 78% probability, $\theta = 0.3$ is possible at 22%, and both extreme values ($\theta = 0.0$ and $\theta = 1.0$) are completely ruled out.
What changed during this update? The data ruled out the extremes—we can’t have $\theta=0$ because we got detections, and we can’t have $\theta=1$ because we missed one transit. The data concentrated probability on $\theta = 0.7$, which is the hypothesis most compatible with a 2/3 success rate. We’re now 78% confident that $\theta = 0.7$. This is Bayesian learning in action: we started uncertain, observed data, and updated to a sharper, more informed belief.
The evidence as a normalizing constant:
We computed $P(D) = 0.0567$. What does this mean?
Interpretation: Before observing, the probability of seeing exactly 2 detections and 1 miss (summing over all possible $\theta$ values, weighted by prior) is 5.67%.
Why we often ignore it:
- For parameter inference, we only need relative probabilities (posterior ratios)
- The evidence is the same for all $\theta$, so it cancels when comparing hypotheses
- We can always normalize at the end (divide by sum)
When it matters:
- Model comparison: Comparing different models (e.g., planet vs. no planet)—the evidence ratio is the Bayes factor
- Posterior predictive distributions: Forecasting future data
For now, just remember: Posterior ∝ Likelihood × Prior. The evidence makes it sum to 1.
3. Sequential Updating: Today’s Posterior is Tomorrow’s Prior
One of the elegant features of Bayesian inference: you can update incrementally as new data arrives.
3.1 The Principle
After observing $D_1$, you have posterior $P(\theta \mid D_1)$.
New data $D_2$ arrives. Update:
$$P(\theta \mid D_1, D_2) = \frac{P(D_2 \mid \theta) \cdot P(\theta \mid D_1)}{P(D_2)}$$
Key insight: Use $P(\theta \mid D_1)$ (yesterday’s posterior) as the new prior.
For independent observations:
$$P(D_2 \mid \theta, D_1) = P(D_2 \mid \theta)$$
(Knowing $D_1$ doesn’t change the likelihood of $D_2$ once you know $\theta$.)
3.2 Continuing the Transit Example
Recall: After 3 observations (2 detected, 1 missed), our posterior was:
- $P(\theta = 0.3 \mid D_1) = 0.222$
- $P(\theta = 0.7 \mid D_1) = 0.778$
New observation (Window 4): Detected (Yes)
Likelihood for this single observation:
| $\theta$ | $P(\text{Yes} \mid \theta) = \theta$ |
|---|---|
| 0.3 | 0.3 |
| 0.7 | 0.7 |
Update:
| $\theta$ | Prior $P(\theta \mid D_1)$ | Likelihood $P(\text{Yes} \mid \theta)$ | Unnormalized Posterior |
|---|---|---|---|
| 0.3 | 0.222 | 0.3 | $0.222 \times 0.3 = 0.0666$ |
| 0.7 | 0.778 | 0.7 | $0.778 \times 0.7 = 0.5446$ |
Evidence: $P(\text{Yes}) = 0.0666 + 0.5446 = 0.6112$
Posterior after 4 observations:
| $\theta$ | $P(\theta \mid D_1, D_2)$ |
|---|---|
| 0.3 | $0.0666 / 0.6112 = 0.109$ |
| 0.7 | $0.5446 / 0.6112 = 0.891$ |
Result: Now 89% confident that $\theta = 0.7$ (up from 78%).
The additional detection increased our confidence in the strong-signal hypothesis.
3.3 Batch vs. Sequential: Same Answer
Alternative approach: Update all at once with all 4 observations (3 detections, 1 miss).
Combined likelihood:
$$P(D \mid \theta) = \theta^3 (1-\theta)^1$$
| $\theta$ | $\theta^3 (1-\theta)$ | Prior | Prior × Likelihood | Posterior |
|---|---|---|---|---|
| 0.3 | $0.3^3 \times 0.7 = 0.0189$ | 0.2 | 0.00378 | 0.109 |
| 0.7 | $0.7^3 \times 0.3 = 0.1029$ | 0.3 | 0.03087 | 0.891 |
The posterior is identical! Whether you update sequentially (observation by observation) or in batch (all at once), the final answer is the same.
Why sequential updating is useful:
Sequential updating enables online learning where you update your model as data streams in, such as from nightly survey observations. It’s memory efficient because you don’t need to store all historical data—just the current posterior. It provides conceptual clarity by letting you see how beliefs evolve with each new piece of evidence. This makes science communication more natural: “After 10 SNe, we thought $H_0 \approx 72$. After 20 SNe, we revised to $H_0 \approx 70$.”
For independent observations, the order doesn’t matter. Observing Yes-Yes-No-Yes gives the same posterior as No-Yes-Yes-Yes. This is a profound property of Bayesian inference: it is path-independent.
4. From Discrete to Continuous: Enter Probability Densities
Discrete updating is great for building intuition, but most parameters in physics and astronomy are continuous: distances, masses, luminosities, redshifts.
4.1 Continuous Parameters
Instead of having $\theta$ take on only discrete values like ${0.0, 0.3, 0.7, 1.0}$, we now consider $\theta$ as a continuous variable that can take any value in an interval, such as $[0, 1]$. This shift requires us to move from probability mass to probability density. In the discrete case, $P(\theta)$ represents the probability that $\theta$ equals a specific value. In the continuous case, $p(\theta)$ represents a probability density—which is not a probability itself, but describes the relative likelihood of different values.
Probability of an interval:
$$P(a \leq \theta \leq b) = \int_a^b p(\theta) d\theta$$
Normalization:
$$\int_{-\infty}^{\infty} p(\theta) d\theta = 1$$
Bayes’ theorem (continuous):
$$p(\theta \mid D) = \frac{p(D \mid \theta) \cdot p(\theta)}{p(D)}$$
where:
$$p(D) = \int p(D \mid \theta) \cdot p(\theta) d\theta$$
The challenge with continuous parameters is that this integral is often hard or impossible to compute analytically. We have three main solutions. First, we can use conjugate priors, where we choose priors such that the posterior has the same functional form as the prior, giving us analytical solutions. Second, we can use grid approximation, where we discretize $\theta$ and compute the posterior on a grid. Third, we can use MCMC to sample from the posterior without computing the integral directly—we’ll cover this in later lectures. Today’s focus is on conjugate priors, which are both elegant and insightful.
5. Conjugate Priors: When Math Works Out Beautifully
5.1 Definition
A prior $p(\theta)$ is conjugate to a likelihood $p(D \mid \theta)$ if the posterior $p(\theta \mid D)$ has the same functional form as the prior.
Example:
- Prior: Beta distribution
- Likelihood: Binomial
- Posterior: Beta distribution (with updated parameters)
Why is this magical? Conjugate priors give us analytical solutions without any need for numerical integration. Updates are closed-form and simple—often just adding counts. The parameters have clear interpretations as pseudo-observations, making the prior intuitive to specify. And the computation is efficient, remaining fast even for large datasets.
5.2 Common Conjugate Pairs in Physics and Astronomy
| Likelihood | Parameter | Prior | Posterior | Common Use Case |
|---|---|---|---|---|
| Binomial | $\theta \in [0,1]$ | Beta($\alpha, \beta$) | Beta($\alpha + k, \beta + n - k$) | Detection/non-detection, transit yes/no, source classification |
| Poisson | $\lambda > 0$ | Gamma($\alpha, \beta$) | Gamma($\alpha + \sum y_i, \beta + n$) | Photon counts, GRBs, source counts per field |
| Normal (known $\sigma^2$) | $\mu \in \mathbb{R}$ | Normal($\mu_0, \tau^2$) | Normal($\mu_n, \tau_n^2$) | Combining distance measurements, averaging magnitudes |
| Normal (known $\mu$) | $\sigma^2 > 0$ | Inverse-Gamma($\alpha, \beta$) | Inverse-Gamma($\alpha’, \beta’$) | Estimating scatter, error modeling |
We’ll work through the first three in detail.
6. The Beta-Binomial Model: Success/Failure Data
6.1 The Binomial Likelihood
Setup: $n$ independent binary trials (success/failure), each with probability $\theta$.
Data: $k$ successes out of $n$ trials.
Likelihood:
$$P(k \mid \theta, n) = \binom{n}{k} \theta^k (1-\theta)^{n-k}$$
The binomial coefficient $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ counts the number of ways to arrange $k$ successes among $n$ trials.
For Bayesian inference, we can ignore constants that don’t depend on $\theta$:
$$P(k \mid \theta, n) \propto \theta^k (1-\theta)^{n-k}$$
6.2 The Beta Prior
The Beta distribution is a continuous distribution on $[0, 1]$, parameterized by two positive numbers $\alpha, \beta > 0$:
$$p(\theta \mid \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1}$$
where $\Gamma(\cdot)$ is the gamma function (generalization of factorial: $\Gamma(n) = (n-1)!$ for integers).
Ignore the normalization constant (we’ll add it back when needed):
$$p(\theta) \propto \theta^{\alpha-1} (1-\theta)^{\beta-1}$$
The Beta distribution has several key properties that make it ideal for modeling probabilities. Its support is $\theta \in [0, 1]$, which is perfect for probabilities. The mean is $\mathbb{E}[\theta] = \frac{\alpha}{\alpha + \beta}$, the mode (for $\alpha, \beta > 1$) is $\frac{\alpha - 1}{\alpha + \beta - 2}$, and the variance is $\text{Var}[\theta] = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$.
The parameters $\alpha$ and $\beta$ can be interpreted as pseudo-observations. Think of $\alpha$ as prior successes, $\beta$ as prior failures, and $\alpha + \beta$ as the total prior “sample size”. This interpretation makes prior specification intuitive.
Several special cases illustrate the Beta distribution’s flexibility. $\text{Beta}(1, 1)$ gives a uniform distribution on $[0, 1]$, representing complete ignorance. $\text{Beta}(2, 2)$ is mildly peaked at 0.5, expressing weak preference for $\theta \approx 0.5$. $\text{Beta}(10, 10)$ is strongly peaked at 0.5, representing confidence that $\theta \approx 0.5$. $\text{Beta}(2, 8)$ peaks at $2/(2+8) = 0.2$, encoding an expectation of low success rate. Finally, $\text{Beta}(0.5, 0.5)$ creates a U-shape that concentrates on extremes 0 and 1—this is rarely used in practice.

Figure 2: Grid of Beta distributions with different (α, β) parameters, showing the flexibility of the Beta family.
6.3 The Beta-Binomial Update (The Magic Happens)
Prior: $\text{Beta}(\alpha, \beta)$ → $p(\theta) \propto \theta^{\alpha-1} (1-\theta)^{\beta-1}$
Likelihood: Binomial with $k$ successes, $n$ trials → $P(k \mid \theta) \propto \theta^k (1-\theta)^{n-k}$
Posterior:
$$p(\theta \mid k, n) \propto P(k \mid \theta) \cdot p(\theta)$$
$$p(\theta \mid k, n) \propto \theta^k (1-\theta)^{n-k} \cdot \theta^{\alpha-1} (1-\theta)^{\beta-1}$$
$$p(\theta \mid k, n) \propto \theta^{(k + \alpha - 1)} (1-\theta)^{(n - k + \beta - 1)}$$
This is another Beta distribution!
$$\theta \mid k, n \sim \text{Beta}(\alpha + k, \beta + n - k)$$
The update rule is beautifully simple. The new $\alpha’ = \alpha + k$ (prior successes plus observed successes), and the new $\beta’ = \beta + (n - k)$ (prior failures plus observed failures). That’s it. No integrals, no numerical approximations. Just add the counts.

Figure 3: Animation of Beta-Binomial updating showing how the prior evolves into the posterior as data arrives.
Why the Beta-Binomial is the workhorse for binary data:
The Beta-Binomial pairing is the workhorse for binary data for several reasons. Conjugacy gives us analytical updates with no computation needed. The parameters have a natural interpretation as pseudo-counts. The distribution is flexible enough to encode weak to strong priors. And the intuition is crystal clear: “We observed 15 successes and 5 failures. Our prior was like having seen 2 successes and 2 failures. So the posterior is like having seen 17 successes and 7 failures.”
This is as simple as Bayesian inference gets. And it’s rigorous—the math is exact.
6.4 Astronomy Example: X-ray Source Detection
Background: You’re searching for X-ray emission from candidate sources identified in optical surveys. Not all optical sources have X-ray counterparts.
Question: What fraction $\theta$ of your candidate sources emit X-rays?
Prior: Based on previous surveys, you expect $\theta \approx 0.5$ (about half have X-ray emission), but you’re quite uncertain. Use $\text{Beta}(5, 5)$ (mean = 0.5, moderate uncertainty).
Data: You observe 30 candidates. X-ray detections: 18. Non-detections: 12.
Posterior:
$$\theta \mid \text{data} \sim \text{Beta}(5 + 18, 5 + 12) = \text{Beta}(23, 17)$$
Posterior mean:
$$\mathbb{E}[\theta \mid \text{data}] = \frac{23}{23 + 17} = \frac{23}{40} = 0.575$$
Posterior mode (peak of distribution):
$$\text{Mode} = \frac{23 - 1}{23 + 17 - 2} = \frac{22}{38} \approx 0.579$$
95% credible interval: Use the quantile function for Beta(23, 17):
from scipy.stats import beta
alpha_post, beta_post = 23, 17
lower = beta.ppf(0.025, alpha_post, beta_post)
upper = beta.ppf(0.975, alpha_post, beta_post)
print(f"95% CI: [{lower:.3f}, {upper:.3f}]")
# Output: [0.424, 0.717]We’re 95% confident that between 42% and 72% of optical sources have X-ray emission. Our best estimate is 58%. It’s instructive to compare different estimates. The data alone (maximum likelihood estimate) gives $18/30 = 0.60$, which ignores the prior. The prior mean is $5/(5+5) = 0.50$. The posterior mean of $23/40 = 0.575$ is a weighted average of prior and data. The posterior is pulled slightly toward the prior (from 0.60 to 0.575), reflecting the prior’s contribution.

Figure 4: Comparison of prior, likelihood, and posterior showing how Bayesian inference combines both sources of information.
7. The Gamma-Poisson Model: Count Data
7.1 The Poisson Likelihood
Setup: Counting events in a fixed interval (time, area, volume). Events occur independently with constant rate $\lambda$.
Data: Observed count $y$.
Likelihood:
$$P(y \mid \lambda) = \frac{\lambda^y e^{-\lambda}}{y!}$$
For Bayesian inference:
$$P(y \mid \lambda) \propto \lambda^y e^{-\lambda}$$
Examples in physics and astronomy:
- Photon counts in a detector
- Number of supernovae per year in a galaxy
- Number of gamma-ray bursts detected
- Number of sources in a survey field
7.2 The Gamma Prior
The Gamma distribution is a continuous distribution on $(0, \infty)$, parameterized by shape $\alpha > 0$ and rate $\beta > 0$:
$$p(\lambda \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta \lambda}$$
Proportional form:
$$p(\lambda) \propto \lambda^{\alpha-1} e^{-\beta \lambda}$$
The Gamma distribution has support $\lambda \in (0, \infty)$, which is perfect for modeling rates. Its mean is $\mathbb{E}[\lambda] = \frac{\alpha}{\beta}$, its mode (for $\alpha \geq 1$) is $\frac{\alpha - 1}{\beta}$, and its variance is $\text{Var}[\lambda] = \frac{\alpha}{\beta^2}$.
The parameters have a natural interpretation: think of $\alpha$ as “prior total counts,” $\beta$ as “prior number of intervals,” and the ratio $\alpha/\beta$ as the prior mean rate. A special case is $\text{Gamma}(1, \beta) = \text{Exponential}(\beta)$, which is memoryless and represents the maximum entropy distribution for positive reals with known mean.
7.3 The Gamma-Poisson Update
Prior: $\text{Gamma}(\alpha, \beta)$ → $p(\lambda) \propto \lambda^{\alpha-1} e^{-\beta \lambda}$
Likelihood for $n$ observations ${y_1, \ldots, y_n}$:
$$P({y_i} \mid \lambda) = \prod_{i=1}^n \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} \propto \lambda^{\sum y_i} e^{-n\lambda}$$
Posterior:
$$p(\lambda \mid {y_i}) \propto \lambda^{\sum y_i} e^{-n\lambda} \cdot \lambda^{\alpha-1} e^{-\beta \lambda}$$
$$p(\lambda \mid {y_i}) \propto \lambda^{(\alpha + \sum y_i - 1)} e^{-(\beta + n)\lambda}$$
This is another Gamma distribution!
$$\lambda \mid {y_i} \sim \text{Gamma}\left(\alpha + \sum y_i, \beta + n\right)$$
The update rule is again simple. The new $\alpha’ = \alpha + \sum y_i$ (prior counts plus observed counts), and the new $\beta’ = \beta + n$ (prior intervals plus observed intervals).
The posterior mean is:
$$\mathbb{E}[\lambda \mid {y_i}] = \frac{\alpha + \sum y_i}{\beta + n}$$
This is a weighted average:
$$\frac{\alpha + \sum y_i}{\beta + n} = \frac{\beta}{\beta + n} \cdot \frac{\alpha}{\beta} + \frac{n}{\beta + n} \cdot \frac{\sum y_i}{n}$$
(Prior mean weighted by prior intervals + data mean weighted by data intervals.)
7.4 Astronomy Example: Gamma-Ray Burst Rate
Background: You’re estimating the rate $\lambda$ of long-duration gamma-ray bursts (LGRBs) detected by Swift in a certain redshift range.
Prior: Based on previous surveys, you expect $\lambda \approx 10$ LGRBs/year. Use $\text{Gamma}(20, 2)$ (mean = 10, moderate confidence—equivalent to 20 prior counts in 2 prior years).
Data: Over 5 years, Swift detects: Year 1: 8 LGRBs, Year 2: 12, Year 3: 7, Year 4: 10, Year 5: 13. Total: 50 LGRBs in 5 years.
Posterior:
$$\lambda \mid \text{data} \sim \text{Gamma}(20 + 50, 2 + 5) = \text{Gamma}(70, 7)$$
Posterior mean:
$$\mathbb{E}[\lambda \mid \text{data}] = \frac{70}{7} = 10 \text{ LGRBs/year}$$
Posterior standard deviation:
$$\text{SD}[\lambda \mid \text{data}] = \sqrt{\frac{70}{7^2}} = \sqrt{\frac{70}{49}} \approx 1.19 \text{ LGRBs/year}$$
95% credible interval:
from scipy.stats import gamma
alpha_post, beta_post = 70, 7
lower = gamma.ppf(0.025, alpha_post, scale=1/beta_post)
upper = gamma.ppf(0.975, alpha_post, scale=1/beta_post)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")
# Output: [7.84, 12.39] LGRBs/yearInterpretation: We’re 95% confident the true rate is between 7.8 and 12.4 LGRBs/year. Our best estimate is 10 LGRBs/year.
The posterior mean (10) equals both the prior mean (10) and the data mean (50/5 = 10). This happens because the observed rate matches the prior expectation—the data confirms the prior, increasing confidence by narrowing the posterior distribution.

Figure 5: Animation of Gamma-Poisson updating for GRB rates, showing how the distribution narrows as data accumulates.
When the data contradicts the prior:
In the GRB example, the data happened to match the prior (both suggest λ = 10). But what if the data suggested a different rate? Suppose the 5-year total was 60 LGRBs (12/year average), not 50.
The posterior would be Gamma(20 + 60, 2 + 5) = Gamma(80, 7), with a posterior mean of 80/7 ≈ 11.4 LGRBs/year. Compare this to the prior mean of 10 and the data mean of 12. The posterior (11.4) is pulled toward the data (12), away from the prior (10), but not all the way. The strength of the prior (20 pseudo-counts) tempers the data (60 real counts).
The lesson is that priors matter most when data is weak. With more data, the posterior converges to the data-driven estimate, regardless of prior.
8. The Normal-Normal Model: Combining Measurements
8.1 The Normal Likelihood (Known Variance)
Setup: We observe $n$ measurements ${y_1, \ldots, y_n}$ of some quantity with true value $\mu$. Measurement errors are Gaussian with known variance $\sigma^2$.
Likelihood for one measurement:
$$p(y_i \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{(y_i - \mu)^2}{2\sigma^2}\right]$$
For $n$ independent measurements:
$$p({y_i} \mid \mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{(y_i - \mu)^2}{2\sigma^2}\right]$$
Proportional form (ignoring constants not depending on $\mu$):
$$p({y_i} \mid \mu) \propto \exp\left[-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \mu)^2\right]$$
Expanding the sum:
$$\sum (y_i - \mu)^2 = \sum y_i^2 - 2\mu \sum y_i + n\mu^2$$
The $\sum y_i^2$ term doesn’t depend on $\mu$, so:
$$p({y_i} \mid \mu) \propto \exp\left[-\frac{n}{2\sigma^2}(\mu - \bar{y})^2\right]$$
where $\bar{y} = \frac{1}{n}\sum y_i$ is the sample mean.
This is a Gaussian in $\mu$ with mean $\bar{y}$ and variance $\sigma^2/n$.
8.2 The Normal Prior
Prior on $\mu$:
$$p(\mu \mid \mu_0, \tau^2) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\left[-\frac{(\mu - \mu_0)^2}{2\tau^2}\right]$$
Proportional form:
$$p(\mu) \propto \exp\left[-\frac{(\mu - \mu_0)^2}{2\tau^2}\right]$$
8.3 The Normal-Normal Update
Posterior:
$$p(\mu \mid {y_i}) \propto p({y_i} \mid \mu) \cdot p(\mu)$$
$$p(\mu \mid {y_i}) \propto \exp\left[-\frac{n}{2\sigma^2}(\mu - \bar{y})^2\right] \cdot \exp\left[-\frac{(\mu - \mu_0)^2}{2\tau^2}\right]$$
Combine exponents (sum in exponent):
$$p(\mu \mid {y_i}) \propto \exp\left[-\frac{1}{2}\left(\frac{n}{\sigma^2}(\mu - \bar{y})^2 + \frac{1}{\tau^2}(\mu - \mu_0)^2\right)\right]$$
Complete the square (algebra omitted—standard trick):
$$p(\mu \mid {y_i}) \propto \exp\left[-\frac{1}{2\tau_n^2}(\mu - \mu_n)^2\right]$$
where:
$$\frac{1}{\tau_n^2} = \frac{1}{\tau^2} + \frac{n}{\sigma^2}$$
$$\mu_n = \frac{\frac{\mu_0}{\tau^2} + \frac{n\bar{y}}{\sigma^2}}{\frac{1}{\tau^2} + \frac{n}{\sigma^2}}$$
This is another Gaussian!
$$\mu \mid {y_i} \sim \mathcal{N}(\mu_n, \tau_n^2)$$
The posterior precision (inverse variance) is $1/\tau_n^2 = 1/\tau^2 + n/\sigma^2$. This means that precisions add: prior precision plus data precision. More precise estimates receive more weight in the final answer.
The posterior mean $\mu_n$ is a precision-weighted average of the prior mean $\mu_0$ and the data mean $\bar{y}$:
$$\mu_n = w_{\text{prior}} \cdot \mu_0 + w_{\text{data}} \cdot \bar{y}$$
where weights are:
$$w_{\text{prior}} = \frac{1/\tau^2}{1/\tau^2 + n/\sigma^2}, \quad w_{\text{data}} = \frac{n/\sigma^2}{1/\tau^2 + n/\sigma^2}$$
There are three instructive limiting cases. With a weak prior ($\tau \to \infty$), the posterior approaches the data mean: $\mu_n \to \bar{y}$. With a strong prior ($\tau \to 0$), the posterior stays at the prior mean: $\mu_n \to \mu_0$. With lots of data ($n \to \infty$), the data overwhelms the prior and $\mu_n \to \bar{y}$.
8.4 Astronomy Example: Combining Distance Measurements
Background: You’re measuring the distance to a nearby galaxy using three different methods.
Measurements:
- Cepheids: $d_1 = 3.2 \pm 0.4$ Mpc
- TRGB (Tip of the Red Giant Branch): $d_2 = 3.6 \pm 0.6$ Mpc
- SBF (Surface Brightness Fluctuations): $d_3 = 3.4 \pm 0.3$ Mpc
Assume:
- Gaussian errors (stated uncertainties)
- Independent measurements
- Flat prior on distance (or very weak prior)
For simplicity: Treat as three independent Normal likelihoods with a flat (uninformative) prior. The posterior is then just the product of likelihoods.
Precision weighting:
Each measurement contributes with weight $w_i = 1/\sigma_i^2$ (precision):
| Method | Distance $d_i$ | $\sigma_i$ | Precision $w_i = 1/\sigma_i^2$ |
|---|---|---|---|
| Cepheids | 3.2 Mpc | 0.4 Mpc | $1/0.16 = 6.25$ |
| TRGB | 3.6 Mpc | 0.6 Mpc | $1/0.36 \approx 2.78$ |
| SBF | 3.4 Mpc | 0.3 Mpc | $1/0.09 \approx 11.11$ |
Posterior mean (precision-weighted average):
$$d_{\text{post}} = \frac{w_1 d_1 + w_2 d_2 + w_3 d_3}{w_1 + w_2 + w_3}$$
$$d_{\text{post}} = \frac{6.25 \times 3.2 + 2.78 \times 3.6 + 11.11 \times 3.4}{6.25 + 2.78 + 11.11}$$
$$d_{\text{post}} = \frac{20.0 + 10.0 + 37.8}{20.14} = \frac{67.8}{20.14} \approx 3.37 \text{ Mpc}$$
Posterior variance:
$$\frac{1}{\tau_{\text{post}}^2} = w_1 + w_2 + w_3 = 20.14$$
$$\tau_{\text{post}} = \frac{1}{\sqrt{20.14}} \approx 0.223 \text{ Mpc}$$
The final result is $d = 3.37 \pm 0.22$ Mpc (mean ± 1σ).
Several observations are worth noting. The posterior mean (3.37) is pulled most strongly toward the SBF measurement (3.4), which has the smallest error and therefore the highest precision. The combined uncertainty (0.22) is smaller than any individual measurement—combining independent measurements reduces uncertainty. The posterior is not simply the average of the three measurements (that would be 3.4)—precision weighting matters!

Figure 6: Combining distance measurements using Normal-Normal updating with precision weighting.
Precision thinking vs. variance thinking:
In Bayesian updating with Gaussians, precisions add (inverse variances):
$$\frac{1}{\tau_{\text{post}}^2} = \frac{1}{\tau_{\text{prior}}^2} + \frac{1}{\sigma_{\text{data}}^2}$$
This is much cleaner than trying to combine variances directly. Always work with precisions when doing Normal-Normal updates. The intuition is that each piece of information contributes precision (information content), and total precision is the sum of individual precisions. This is the Bayesian version of “information adds.”
9. When Conjugacy Isn’t Available: Grid Approximation
Not all likelihood-prior combinations are conjugate. What do we do then? We have three options. First, we can find a conjugate prior that’s close enough—this often works well even if it’s only an approximation. Second, we can use grid approximation, numerically evaluating the posterior on a discrete grid. Third, we can use MCMC or other sampling methods, which we’ll cover in later lectures. Today we’ll focus on grid approximation, which is simple and transparent, though limited to low dimensions.
9.1 The Grid Approximation Algorithm
The grid approximation algorithm follows five simple steps. First, define a grid of parameter values ${\theta_1, \theta_2, \ldots, \theta_M}$. Second, evaluate the prior at each grid point to get $p(\theta_i)$. Third, evaluate the likelihood at each grid point to get $p(D \mid \theta_i)$. Fourth, compute the unnormalized posterior $p(\theta_i) \cdot p(D \mid \theta_i)$ for each $i$. Finally, normalize by dividing by the sum $\sum_i p(\theta_i) \cdot p(D \mid \theta_i) \cdot \Delta\theta$, where $\Delta\theta$ is the grid spacing for continuous grids. The result is a discrete approximation to $p(\theta \mid D)$.
Grid approximation has several advantages. It’s simple to code, transparent (you see every step), and works for any likelihood-prior combination. However, it suffers from the curse of dimensionality. With $k$ parameters and $M$ grid points each, you need $M^k$ evaluations, which grows exponentially. For 1D problems with 100 points, you need only 100 evaluations (easy). For 2D, you need 100×100 = 10,000 evaluations (still fine). But for 5D, you’d need 100^5 = 10 billion evaluations (impossible on a laptop). Grid approximation also wastes computation on low-posterior regions.
Use grid approximation for problems with 1-2 parameters, non-conjugate or complex likelihoods, and exploratory analysis.
9.2 Example: Stellar Temperature from Color Index
Background: You observe a star’s $B - V$ color index and want to infer its effective temperature $T_{\text{eff}}$.
Model: Empirical color-temperature relation (simplified):
$$B - V \approx 0.92 - 0.00016 \times T_{\text{eff}} \quad \text{(approximation for main-sequence stars)}$$
Data: $(B - V)_{\text{obs}} = 0.65 \pm 0.05$
Parameter: $T_{\text{eff}} \in [4000, 8000]$ K (main-sequence range)
Prior: Uniform on $T_{\text{eff}}$ (no strong preference)
$$p(T_{\text{eff}}) = \frac{1}{8000 - 4000} = \frac{1}{4000} \quad \text{for } T_{\text{eff}} \in [4000, 8000]$$
Likelihood: Gaussian measurement error on $B - V$:
$$(B - V){\text{obs}} \mid T{\text{eff}} \sim \mathcal{N}\left(0.92 - 0.00016 \times T_{\text{eff}}, 0.05^2\right)$$
This is non-conjugate! (Gaussian likelihood with a nonlinear function of the parameter, and uniform prior—no simple closed form for posterior.)
Solution: Grid approximation.
Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Grid
T_grid = np.linspace(4000, 8000, 1000)
# Prior (uniform)
prior = np.ones_like(T_grid) / (8000 - 4000)
# Observed B-V
BV_obs = 0.65
sigma_BV = 0.05
# Model: B-V as a function of T_eff
BV_model = 0.92 - 0.00016 * T_grid
# Likelihood: Gaussian in B-V
likelihood = norm.pdf(BV_obs, loc=BV_model, scale=sigma_BV)
# Posterior (unnormalized)
posterior_unnorm = prior * likelihood
# Normalize (integrate using trapezoidal rule)
from scipy.integrate import trapz
evidence = trapz(posterior_unnorm, T_grid)
posterior = posterior_unnorm / evidence
# Posterior mode
T_mode = T_grid[np.argmax(posterior)]
print(f"Posterior mode: {T_mode:.0f} K")
# Posterior mean
T_mean = trapz(T_grid * posterior, T_grid)
print(f"Posterior mean: {T_mean:.0f} K")
# 95% credible interval (numerical integration to find quantiles)
from scipy.interpolate import interp1d
cumulative = np.cumsum(posterior) / np.sum(posterior)
quantile_func = interp1d(cumulative, T_grid, fill_value="extrapolate")
T_lower = quantile_func(0.025)
T_upper = quantile_func(0.975)
print(f"95% CI: [{T_lower:.0f}, {T_upper:.0f}] K")
# Plot
plt.figure(figsize=(10, 4))
plt.plot(T_grid, posterior, 'b-', lw=2, label='Posterior')
plt.axvline(T_mode, color='r', linestyle='--', lw=1.5, label=f'Mode = {T_mode:.0f} K')
plt.axvline(T_lower, color='gray', linestyle=':', lw=1, label=f'95% CI [{T_lower:.0f}, {T_upper:.0f}]')
plt.axvline(T_upper, color='gray', linestyle=':', lw=1)
plt.xlabel('T_eff (K)')
plt.ylabel('Posterior density')
plt.title('Posterior for Effective Temperature from B-V Color')
plt.legend()
plt.show()Output:
Posterior mode: 5688 K
Posterior mean: 5690 K
95% CI: [5378, 6002] KThe star’s effective temperature is most likely around 5690 K (consistent with a G-type main-sequence star like the Sun!). We’re 95% confident it’s between 5378 and 6002 K.

Figure 7: Grid approximation for stellar temperature, showing the discrete numerical evaluation on a grid of parameter values.
Grid approximation limitations:
The curse of dimensionality means that for $k$ parameters, grid size grows as $M^k$. Consider stellar parameters $(T_{\text{eff}}, \log g, [\text{Fe/H}])$ as an example—three dimensions. With 100 points per dimension, you need $100^3 = 1$ million evaluations (doable). With 200 points, you need $200^3 = 8$ million (slow). For 10 parameters at 100 points each, you’d need $100^{10} = 10^{20}$ evaluations (impossible!)
The solution for high dimensions is Markov Chain Monte Carlo (MCMC), which we’ll cover from Lecture 7 onward. MCMC explores the posterior adaptively, spending time where posterior mass is concentrated and avoiding wasted computation in low-probability regions.
Grid approximation is sufficient for problems with 1-2 dimensions, non-standard likelihoods, quick exploratory analysis, and pedagogical examples.
10. Prior Elicitation: Where Do Priors Come From?
One of the most common questions about Bayesian inference: “How do I choose a prior?”
10.1 Types of Priors
Priors fall into three broad categories. Informative priors incorporate substantial domain knowledge. For example, you might use $\text{Beta}(80, 20)$ for quasar detection efficiency based on extensive simulations and previous surveys. Use informative priors when you have strong prior information from previous experiments, physical theory, or expert consensus.
Weakly informative priors gently regularize without imposing strong beliefs. For instance, you might use $\text{Normal}(0, 10)$ for a regression slope when you expect effects to be modest but aren’t sure of their magnitude. Weakly informative priors are the default for most scientific applications. They rule out absurdities (negative distances, velocities faster than light) while staying modest.
Uninformative or flat priors assign equal probability everywhere, or over a wide range. An example is $\text{Uniform}(0, 1)$ for an unknown probability. However, these can be dangerous—they may be improper (infinite integral) or implicitly strong (giving equal weight to absurd and reasonable values). Use them sparingly, as “uninformative” priors are often more informative than you think.
10.2 Methods for Prior Elicitation
The first method relies on domain expertise. Ask yourself (or a domain expert) three key questions: “What’s a typical value for this parameter?” “What range would be surprising but not impossible?” and “What values are physically impossible?” Then translate these answers to a distribution. The typical value becomes the mean or mode. The plausible range determines the standard deviation or quantiles. Impossible values define the support (such as $\theta \geq 0$ for a rate parameter).
For example, consider galaxy stellar mass. Expert knowledge tells us that most galaxies have $\log(M_\star/M_\odot) \in [9, 11.5]$, with typical values around 10.5 (Milky Way-like). This translates to a prior $\log(M_\star/M_\odot) \sim \mathcal{N}(10.5, 1^2)$, where the mean of 10.5 represents typical galaxies and the standard deviation of 1 dex covers the range from about 8.5 to 12.5 at ±2σ, which is reasonable.
The second method uses previous studies through empirical Bayes. You take posteriors from previous analyses as priors for new analyses. For instance, if a previous study (SH0ES 2019) found $H_0 = 74.0 \pm 1.4$ km/s/Mpc, you could use $H_0 \sim \mathcal{N}(74.0, 1.4^2)$ as your prior for a new analysis. Use this approach with caution—it assumes the previous analysis was unbiased and errors were correctly estimated.
The third method uses maximum entropy to choose the least informative prior given constraints. For example, with no constraints, a uniform distribution has maximum entropy on a finite interval. With known mean and support on $[0, \infty)$, the exponential distribution is maximum entropy. With known mean and variance on $(-\infty, \infty)$, the Gaussian is maximum entropy. This is a principled approach, though it requires specifying constraints, which is itself a choice.
The fourth method uses prior predictive simulation. Simulate data from your prior and check if the simulations are plausible. For example, suppose you’re considering a prior on SN absolute magnitude $M \sim \mathcal{N}(-19, 2^2)$. For a prior predictive check, simulate 1000 SNe and compute apparent magnitudes at various redshifts. Do the simulated magnitudes cover the range you expect? If you’re frequently getting $m > 30$ (too faint to detect) or $m < 10$ (brighter than Venus), your prior is too broad. You might narrow it to $M \sim \mathcal{N}(-19, 0.5^2)$ and re-check.
10.3 Prior Sensitivity Analysis
The best practice is to try multiple reasonable priors and see how the posterior changes. Follow this procedure: First, choose 3-5 different priors spanning weak to strong, optimistic to pessimistic. Second, compute the posterior for each prior using the same data. Third, compare the posteriors. If they’re similar, the data dominates and prior choice doesn’t matter much—you have a robust conclusion. If they’re very different, the prior matters and your conclusion is sensitive to prior choice—be transparent and report the range of results.
Example: GRB rate with different priors
Data: 50 GRBs in 5 years (rate = 10/year).
Priors:
- Weak: $\text{Gamma}(2, 1)$ (mean = 2, very diffuse—almost no prior information)
- Medium: $\text{Gamma}(20, 2)$ (mean = 10, moderate confidence)
- Strong: $\text{Gamma}(50, 5)$ (mean = 10, strong confidence)
The posteriors are: Weak prior gives $\text{Gamma}(52, 6)$ with mean = 8.67 and SD = 1.18. Medium prior gives $\text{Gamma}(70, 7)$ with mean = 10.0 and SD = 1.19. Strong prior gives $\text{Gamma}(100, 10)$ with mean = 10.0 and SD = 1.00.
Several observations emerge. The weak prior pulls the posterior below the data mean because the prior was pessimistic with mean = 2. The medium prior gives a posterior matching the data, showing the prior was well-calibrated. The strong prior gives a narrow posterior where high prior confidence dominates. With 50 observations, the prior still has noticeable influence, especially if strong. For a robust estimate, prefer the medium prior—informed but not dominating.
Reporting priors in scientific papers:
When reporting Bayesian analyses, do state your priors explicitly with functional form and parameters. Justify priors with domain knowledge or previous studies. Perform prior sensitivity analysis, at least in supplementary material. Show prior predictive distributions, especially for complex models.
Don’t use priors without justification like “we chose Beta(1,1) because it’s uniform.” Don’t hide priors in the methods section with no discussion. Don’t claim priors are “uninformative” without demonstrating it via prior predictive checks. Don’t use very strong priors unless you can defend them robustly.
Transparency is key. Bayesian inference makes assumptions explicit—embrace this, don’t hide it.
11. Credible Intervals: The Bayesian Confidence Interval
11.1 Definition
A 95% credible interval for parameter $\theta$ is an interval $[L, U]$ such that:
$$P(L \leq \theta \leq U \mid D) = 0.95$$
Interpretation: “Given the data, there’s a 95% probability that $\theta$ lies in $[L, U]$.”
This is a probability statement about the parameter $\theta$, conditioned on the observed data.
11.2 Contrast with Frequentist Confidence Intervals
Frequentist 95% confidence interval: An interval $[L, U]$ constructed such that, if we repeated the experiment infinitely many times, 95% of the intervals would contain the true $\theta$.
The key difference lies in what the probability statement refers to. A frequentist confidence interval is a probability statement about the procedure (long-run frequency over hypothetical repetitions). It means “95% of such intervals will contain the true value.” You cannot say “there’s a 95% probability $\theta$ is in this interval” because in the frequentist framework, $\theta$ is fixed, not random—either it’s in the interval or it isn’t.
A Bayesian credible interval is a probability statement about the parameter given the data you actually observed. It means “there’s a 95% probability $\theta$ is in this interval.” This is the interpretation most people naturally want!
For simple problems with weak priors, credible intervals and confidence intervals are often numerically similar. But their interpretations are fundamentally different.

Figure 8: Illustrating the conceptual difference between Bayesian credible intervals and frequentist confidence intervals.
Common misinterpretation of frequentist CIs:
The wrong interpretation is “There’s a 95% probability the true value is in this interval.” The right interpretation is “If we repeated this procedure many times, 95% of the resulting intervals would contain the true value.”
Most people want the Bayesian interpretation—they care about this data and this parameter estimate, not hypothetical repetitions. Bayesian credible intervals provide exactly that interpretation.
In practice, for simple models with Gaussian errors and large samples, Bayesian credible intervals and frequentist confidence intervals are numerically very close. But if you want the intuitive interpretation, use Bayesian inference.
11.3 Types of Credible Intervals
The most common type is the equal-tailed interval, where you place 2.5% in each tail:
$$P(\theta < L \mid D) = 0.025, \quad P(\theta > U \mid D) = 0.025$$
Code (for Beta posterior):
from scipy.stats import beta
alpha, beta_param = 23, 17
L = beta.ppf(0.025, alpha, beta_param)
U = beta.ppf(0.975, alpha, beta_param)The alternative is the Highest Posterior Density (HPD) interval, which is the shortest interval containing 95% of the posterior mass. For symmetric unimodal distributions like the Gaussian, HPD equals the equal-tailed interval. For skewed distributions, HPD can be shorter and more informative.
Consider a posterior $\text{Beta}(2, 8)$ that’s skewed left. The equal-tailed 95% CI is [0.03, 0.45], while the HPD 95% CI is [0.02, 0.42]—slightly shorter and shifted left to capture more of the peak.
In practice, equal-tailed intervals are simpler and standard. HPD is more efficient (shorter) for skewed distributions but requires numerical optimization.
12. Sequential Learning Example: Monitoring a Supernova Rate
Let’s tie everything together with a realistic astronomy example involving sequential updating.
12.1 Scenario
You’re estimating the rate of Type Ia supernovae in a particular galaxy cluster. You monitor the cluster over multiple years, observing how many SNe occur each year.
Parameter: $\lambda$ = true SN rate (SNe per year).
Prior (Year 0): Based on cluster stellar mass and SN rates in similar clusters, you expect $\lambda \approx 3$ SNe/year. Use $\text{Gamma}(12, 4)$ (mean = 3, moderate confidence—equivalent to 12 prior SNe over 4 prior years).
Data (sequential observations):
- Year 1: 2 SNe observed
- Year 2: 5 SNe observed
- Year 3: 3 SNe observed
- Year 4: 4 SNe observed
Goal: Update the posterior after each year, watching beliefs evolve.
12.2 Updates
After Year 1:
- Prior: $\text{Gamma}(12, 4)$
- Data: 2 SNe
- Posterior: $\text{Gamma}(12 + 2, 4 + 1) = \text{Gamma}(14, 5)$
- Posterior mean: $14/5 = 2.8$ SNe/year
- 95% CI: [1.4, 4.6] SNe/year
After Year 2:
- Prior: $\text{Gamma}(14, 5)$
- Data: 5 SNe
- Posterior: $\text{Gamma}(14 + 5, 5 + 1) = \text{Gamma}(19, 6)$
- Posterior mean: $19/6 \approx 3.17$ SNe/year
- 95% CI: [1.9, 4.7] SNe/year
After Year 3:
- Prior: $\text{Gamma}(19, 6)$
- Data: 3 SNe
- Posterior: $\text{Gamma}(19 + 3, 6 + 1) = \text{Gamma}(22, 7)$
- Posterior mean: $22/7 \approx 3.14$ SNe/year
- 95% CI: [2.0, 4.5] SNe/year
After Year 4:
- Prior: $\text{Gamma}(22, 7)$
- Data: 4 SNe
- Posterior: $\text{Gamma}(22 + 4, 7 + 1) = \text{Gamma}(26, 8)$
- Posterior mean: $26/8 = 3.25$ SNe/year
- 95% CI: [2.1, 4.6] SNe/year
12.3 Observations
The evolution of beliefs shows how Bayesian updating concentrates our knowledge over time. At Year 0, the prior has mean = 3.0 and SD = $\sqrt{12}/4 = 0.87$. By Year 4, the final posterior has mean = 3.25 and SD = $\sqrt{26}/8 \approx 0.64$.
What happened during this process? The mean shifted slightly from 3.0 to 3.25 as data arrived, with the total observed rate being 14 SNe over 4 years, or 3.5 SNe/year. Uncertainty decreased from SD = 0.87 to 0.64 as more data accumulated. Credible intervals narrowed, reflecting increased confidence. With more years of data, the posterior will continue to concentrate around the true rate, with uncertainty shrinking as $1/\sqrt{n}$, which is typical for well-behaved likelihoods.
The power of sequential updating:
Sequential updating enables real-time learning where you don’t need to wait for all data—just update as observations come in. If the posterior becomes sufficiently narrow, you can stop collecting data using Bayesian sequential design principles. The approach is adaptive: if early data contradicts the prior, the posterior adjusts quickly. It’s also transparent, allowing stakeholders to see beliefs evolving over time.
For decision-making, consider what would happen if the cluster rate were much higher than expected—say, 10 SNe in Year 1. The posterior would shift dramatically, signaling that our prior was wrong or something unusual is happening, which would be worth investigating.
13. Summary
We’ve covered a lot of ground in this lecture. Let’s consolidate the key ideas.
Bayesian updating is fundamentally mechanical: Posterior ∝ Likelihood × Prior. For discrete or gridded problems, it’s just arithmetic—multiply and normalize. For conjugate pairs, it’s even simpler: just update parameters.
Conjugate priors are elegant and powerful. Beta-Binomial handles binary outcomes like detections, transits, and classifications. Gamma-Poisson handles count data including photons, events, and sources per field. Normal-Normal combines measurements with Gaussian errors. All offer analytical solutions with intuitive interpretations as pseudo-observations and efficient computation.
Sequential updating lets yesterday’s posterior become today’s prior. Order doesn’t matter for independent observations, and beliefs evolve incrementally with new data.
Priors matter most when data is weak. With strong data, the posterior converges to the data-driven estimate regardless of prior. Always perform sensitivity analysis by trying multiple reasonable priors.
Credible intervals provide the intuitive interpretation people want: “95% probability the parameter is in this interval, given the data.” This is what Bayesian inference naturally provides.
Grid approximation works well for problems with 1-2 dimensions when conjugacy isn’t available. Beyond that, we need MCMC, which we’ll cover in later lectures.
By working through this lecture, you should now be able to perform discrete Bayesian updates by hand, taking prior through likelihood to posterior. You can apply the conjugate updates for Beta-Binomial, Gamma-Poisson, and Normal-Normal models. You understand how to interpret prior parameters as pseudo-observations and pseudo-counts. You know how to combine multiple measurements using precision weighting. You can perform prior sensitivity analysis to check robustness. You can compute and interpret credible intervals correctly. You understand sequential updating where beliefs evolve as new data arrives. And you can use grid approximation for non-conjugate problems when working in low dimensions.
The mathematics of Bayesian updating is now familiar. Next, we learn the art of building good models.
Or the Fully Marginalized Likelihood (FML) as David W. Hogg (Flatiron) puts it, because that is how you will feel when you have to compute it for real problems. ↩︎