Why This course?

You’ve almost graduated with a degree in physics or astronomy.

You can solve differential equations, claim to understand quantum mechanics, and perhaps derive the Friedmann equations. So why do you need a course on data analysis?

Most of what you will do as a researcher is inverse problems – inferring unknowns from noisy, incomplete data. And most physics curricula don’t teach you how to do this rigorously.

You’ve probably done one or more of the following:

Fit lines to data with np.polyfit or curve_fit
Calculated error bars by “propagating uncertainties”
Looked at p-value and never really admitted to yourself that you don’t understand what it really means
Been confused about when to use $\chi^2$ or likelihoods
What this Bayesian religion is that people talk about

This course will change how you think about data. And it will make you rich. Rich beyond your wildest dreams! Either because you’ll use what you learn in this class in your research (now or in the future), or because you’ll use what you learn here when you go into industry. Or in the worst possible scenario, you’ll be rich with knowledge. The point is: don’t feel upset if you have no plans to use rigorous data analysis methods in the future. This course will still be extremely valuable to you.

This course will change how you think about data.

Generative Models

In the physical sciences the right way to think about data is to think about how it was generated. That is to say: what set of inputs had some impact on the measured values. Here is a more formal definition:

A generative model is a parametrised, quantitative description of a statistical procedure that could reasonably have generated the data set you have. ¹

Now consider that you have been given some data set consisting of $x$ and $y$ values. You are asked to provide a model that describes these data. Each data point has uncertainties in the $x$- and $y$- direction, and perhaps there could even be some covariance between the uncertainties in each direction.

generative-models — A simple data set that could have been generated by a straight line. Data points have uncertainties in the $x$- and $y$- directions, and the uncertainty in each direction is correlated and different for each data point.

We will assume that the data can be modelled by a straight line, which is the simplest model we can think of. We can imagine something more exotic later, but it is always the right idea to start simple. Note that we almost never know if the model we have adopted is the true model or not, but incorrect models are still useful!

Here are model – or more specifically a generative model – is one that can describe how measured data points were generated. When you have a generative model then there is very little arbitrariness: if the choice of model or method is not consistent, then that will become apparent. You can imagine a generative model is one where if you “turned the crank”, then it would generate more data points for you (perhaps given some inputs $x$).

Now, you might think that if there is a problem in physics or astronomy that is so simple that it can be fit with just a straight line, then it cannot be an interesting problem.

The Hubble Constant

The Hubble Constant describes the current rate of the Universe’s expansion. It is usually expressed in units of kilometers per second per megaparsec. If you want to measure the expansion rate of the Universe today then it seems relatively straightforward:

Measure distances to galaxies: $d_i \pm \sigma_{d,i}$
Measure recession velocities: $v_i \pm \sigma_{v,i}$
Fit: $v = H_0 \cdot d$
Report: $H_0 \pm \sigma_{H_0}$

But reality is messy:

Distance measurements have asymmetric errors (not Gaussian!)
Unknown systematic uncertainties (Cepheid calibration, SN Ia standardization)
Selection effects: You can only measure galaxies above a brightness threshold
Outliers: Some galaxies have peculiar velocities
Model uncertainty: Is the relation exactly linear? Should we include a redshift correction?
Prior information: Previous measurements (Planck, HST) disagree—how do we combine them?

Traditional approaches struggle:

χ² minimization assumes Gaussian errors (not true)
Error propagation formulas assume linearity (not true for systematics)
p-values don’t tell you “How confident should I be in $H_0 = 70$ km/s/Mpc?”
No principled way to incorporate prior knowledge or constraints

Bayesian inference handles all of this naturally. We will introduce Bayesian probability in the next lesson, but if you haven’t heard about it already, it is simply a perspective of how to reason under uncertain data.

Animation showing the Hubble diagram (velocity vs. distance). Start with clean data and a perfect linear fit. Then gradually add: (1) error bars (both axes), (2) asymmetric errors, (3) outliers (highlighted in red), (4) selection effects (fade out faint galaxies below detection threshold), (5) systematic uncertainty bands.

What You’ll Learn to Do

By the end of this course, you’ll be able to:

1. Fit Complex Models to Messy Data

Handle non-Gaussian errors, outliers, censoring, missing data
Fit hierarchical models (e.g., galaxy populations with group-level properties)
Incorporate systematic uncertainties properly

2. Quantify Uncertainty Correctly

Understand what credible intervals actually mean
Propagate uncertainties through complex calculations
Distinguish between measurement error and model uncertainty

3. Make Predictions

Forecast predictions and make decisions
Generate posterior predictive distributions

4. Compare Models

Is a power-law or exponential better for this galaxy profile?
Does adding another parameter improve the model, or is it just overfitting?
Use information criteria and Bayes factors

5. Think Causally

Avoid spurious correlations (selection effects, confounders)
Understand when you can infer causation compared to association
Design analyses that answer scientific questions (not just “what correlates with what?”)

6. Validate Your Results

Prior predictive checks: Does my model make sense before seeing data?
Posterior predictive checks: Does my fitted model reproduce the data?
Sensitivity analysis: How much do my conclusions depend on assumptions?

What Makes This Course Different?

The philosophy of this course is that you’ll learn to think like a Bayesian, not just apply Bayesian formulas.

Many stats courses teach:

Formulas (t-tests, ANOVA, regression)
p-values and null hypothesis testing
“Cookbook” approaches

This course teaches:

Generative modeling: Build a model of how your data were generated
Probability as a tool for inference: What do I believe given what I observed?
Computational methods: MCMC, variational inference
Model checking: Does my model make sense? How do I know?
Causal thinking: What questions can I actually answer with my data?

Why this course matters:

Astronomy is fundamentally about inference from noisy, incomplete data
Traditional methods (χ², p-values) are often inadequate for real problems
Bayesian inference provides a principled, flexible framework
You’ll use these tools in your research career

What you’ll gain:

Ability to fit complex models to messy data
Proper uncertainty quantification
Skills to validate and criticize your models
Computational tools (MCMC, VI) used across science

The Workflow You’ll Master

For every data analysis problem, you should start with the simplest model you can possibly think of. For example, you can know that the data has uncertainties in the $x$-direction but simply ignore them in your first model. Ignore everything complicated and just start simple. You’ll know when you’re getting the hang of data analysis when you find yourself feeling totally gross that you are using such a simple model for a problem, but that is the place to start. When you have a simple model that you know works, you can add complexity to that and compare your more complex models to your simple baseline. This is an extraordinary superpower for building complex models and understanding where and why things went wrong.

If you start with the most complex model, you will likely find that it doesn’t work, and it takes a long time to run! When something takes a long time to run and there are a dozen different things that could be causing the problem, you are hosed. Start simple and add complexity one bit at a time. I cannot possibly stress this enough.

Each time you are building a model, you should be following this cycle:

Ask a scientific question: causal or predictive
Build a generative model: how were the data produced?
Encode domain knowledge: priors, constraints
Check the model before fitting: prior predictive simulation
Fit the model: analytical when possible, optimization, VI, MCMC
Diagnose convergence: did the fitting fail? did the sampler work?
Check the fit: posterior predictive—does it reproduce the data?
Interpret & communicate: what did we learn?
Iterate: revise model and compare

This workflow prevents most common mistakes. The rubric of your project-based work is structured to require this workflow, so you should practice it as early and often as you can.

Prerequisites and Expectations

What you need:

Comfort with calculus and linear algebra
Basic programming (e.g., Python)
Willingness to think probabilistically

What you don’t need:

Prior statistics knowledge (we assume none!)
Machine learning background
Measure theory or advanced probability (nice to have, not required)

Course structure:

Lectures: Concepts, theory, worked examples
Hands-on coding: You’ll implement methods from scratch
Projects: Apply to real astronomy problems

A Glimpse of What’s Coming

Week 1: Fundamentals

Why This Course?
Probability As Extended Logic
Bayesian Updating In Practice

Week 2: Workflow & Model Building

Practical Guide to Principled Bayesian Inference
Fitting A Line To Data I
Fitting A Line To Data II

Week 3: Optimization and Sampling

Linear Models
Optimisation
Markov Chain Monte Carlo

Week 4: Advanced Models

Gaussian Processes I
Gaussian Processes II
Hierarchical Models

Week 5: Latent Structure & Model Comparison

Mixture Models
Latent Variable Models
Comparing And Selecting Models

Week 6: Project Viva

By the end: You’ll be able to tackle real research problems with confidence.

Assessments

The assessment structure for this unit is:

Problem Set 1
- Released Friday of Week 1
- Due Friday of Week 2
- 20% of final grade
Problem Set 2
- Released Friday of Week 2
- Due Friday of Week 3
- 20% of final grade
Problem Set 3
- Released Friday of Week 3
- Due Friday of Week 4
- 20% of final grade
Project
- Released Friday of Week 3
- Due Thursday of Week 6
- 40% of final grade

Approach to generative artificial intelligenc (AI) and large language models (LLMs). You are here to learn. LLMs can be useful for searching topics, summarising information, and clarifying understanding. But supplying your problem to a LLM and directly copying the output to your assessment does not help you learn. It builds a cognitive dissonance between you and the problem, and gives you a false feeling that ‘you did stuff’ or ‘you know how to do things’. It can sometimes be difficult to understand where the grey line sits between ’never use the LLM’ and ‘always use the LLM’.

I encourage you to read a recent blog post from Anthropic research about the cognitive dissonance from delegating work to LLMs and how it negatively impacts you in the short- and long-term. And to read this position paper on Why do we do astrophysics?, which is a useful question to ask ourselves at a time when it appears that LLMs are almost becoming capable of generating ideas, or documents that look like research papers.

Here are some guiding principles for this class:

If you used the LLM to help you with a problem, and you could not later solve that problem yourself without calling on an LLM again, you have used it too much.
You should consider treating LLMs like a peer colleague or student², in that it is reasonable to ask an LLM or a colleague to look over what you have done and highlight anything that may be incorrect. But it would be unreasonable to ask another student or colleague to simply do all your work for you, or to write up a report.
If you cannot explain what you have submitted, and provide justified reasoning for what you submitted, then you are not the one who should be awarded the grade (the LLM should!)

With that context in mind, you are allowed to use generative artificial intelligence for work in this class under the following conditions:

Conditions 1 and 2 relate to the Problem Sets and the Assignment:

You must write a short summary of whether LLMs were used for this work, and if so, in what contexts. If you did not use any LLM, then that’s great, but you will still have to explicitly write and submit that.
If you used any generative AI for this work then you must provide the complete transcripts of your interactions with any LLMs or generative AI used as part of this work. If it is determined that generative AI or LLMs were likely used more than what the provided transcripts suggest, or if generative AI or LLMs were likely used and no transcripts are provided, then the student will be reported for academic misconduct. Please don’t do this. Dealing with academic misconduct is the worst part of this job.

Condition 3 relates only to the Assignment:

All students will participate in a mandatory viva for their work. In addition to your submitted material, you will need to prepare a presentation on your work. You will present high-level insights suitable to an audience that includes “C-suites” (non-technical people; muggles) and technical people, which must be in languge accessible to all. You will then be asked technical questions on your work. Your assignment grade will be moderated by your capability to explain what you have done, and to justify your answers.

Data analysis recipes: Fitting a model to data, Hogg, D. W., Bovy, J. and Lang, D., 2010. ↩︎
This useful perspective was provided by Adrian Price-Whelan (Flatiron Institute). ↩︎

Last updated on 2 March, 2026

Probability as Extended Logic