Why This course?
You’ve almost graduated with a degree in physics or astronomy.
You can solve differential equations, claim to understand quantum mechanics, and perhaps derive the Friedmann equations. So why do you need a course on data analysis?
Most of what you will do as a researcher is inverse problems – inferring unknowns from noisy, incomplete data. And most physics curricula don’t teach you how to do this rigorously.
You’ve probably done one or more of the following:
- Fit lines to data with
np.polyfitorcurve_fit - Calculated error bars by “propagating uncertainties”
- Looked at p-value and never really admitted to yourself that you don’t understand what it really means
- Been confused about when to use $\chi^2$ or likelihoods
- What this Bayesian religion is that people talk about
This course will change how you think about data. And it will make you rich. Rich beyond your wildest dreams! Either because you’ll use what you learn in this class in your research (now or in the future), or because you’ll use what you learn here when you go into industry. Or in the worst possible scenario, you’ll be rich with knowledge. The point is: don’t feel upset if you have no plans to use rigorous data analysis methods in the future. This course will still be extremely valuable to you.
This course will change how you think about data.
Generative Models
In the physical sciences the right way to think about data is to think about how it was generated. That is to say: what set of inputs had some impact on the measured values. Here is a more formal definition:
A generative model is a parametrised, quantitative description of a statistical procedure that could reasonably have generated the data set you have. 1
Now consider that you have been given some data set consisting of $x$ and $y$ values. You are asked to provide a model that describes these data. Each data point has uncertainties in the $x$- and $y$- direction, and perhaps there could even be some covariance between the uncertainties in each direction.

We will assume that the data can be modelled by a straight line, which is the simplest model we can think of. We can imagine something more exotic later, but it is always the right idea to start simple. Note that we almost never know if the model we have adopted is the true model or not, but incorrect models are still useful!
Here are model – or more specifically a generative model – is one that can describe how measured data points were generated. When you have a generative model then there is very little arbitrariness: if the choice of model or method is not consistent, then that will become apparent. You can imagine a generative model is one where if you “turned the crank”, then it would generate more data points for you (perhaps given some inputs $x$).
Now, you might think that if there is a problem in physics or astronomy that is so simple that it can be fit with just a straight line, then it cannot be an interesting problem.
The Hubble Constant
The Hubble Constant describes the current rate of the Universe’s expansion. It is usually expressed in units of kilometers per second per megaparsec. If you want to measure the expansion rate of the Universe today then it seems relatively straightforward:
- Measure distances to galaxies: $d_i \pm \sigma_{d,i}$
- Measure recession velocities: $v_i \pm \sigma_{v,i}$
- Fit: $v = H_0 \cdot d$
- Report: $H_0 \pm \sigma_{H_0}$
But reality is messy:
- Distance measurements have asymmetric errors (not Gaussian!)
- Unknown systematic uncertainties (Cepheid calibration, SN Ia standardization)
- Selection effects: You can only measure galaxies above a brightness threshold
- Outliers: Some galaxies have peculiar velocities
- Model uncertainty: Is the relation exactly linear? Should we include a redshift correction?
- Prior information: Previous measurements (Planck, HST) disagree—how do we combine them?
Traditional approaches struggle:
- χ² minimization assumes Gaussian errors (not true)
- Error propagation formulas assume linearity (not true for systematics)
- p-values don’t tell you “How confident should I be in $H_0 = 70$ km/s/Mpc?”
- No principled way to incorporate prior knowledge or constraints
Bayesian inference handles all of this naturally. We will introduce Bayesian probability in the next lesson, but if you haven’t heard about it already, it is simply a perspective of how to reason under uncertain data.

What You’ll Learn to Do
By the end of this course, you’ll be able to:
1. Fit Complex Models to Messy Data
- Handle non-Gaussian errors, outliers, censoring, missing data
- Fit hierarchical models (e.g., galaxy populations with group-level properties)
- Incorporate systematic uncertainties properly
2. Quantify Uncertainty Correctly
- Understand what credible intervals actually mean
- Propagate uncertainties through complex calculations
- Distinguish between measurement error and model uncertainty
3. Make Predictions
- Forecast predictions and make decisions
- Generate posterior predictive distributions
4. Compare Models
- Is a power-law or exponential better for this galaxy profile?
- Does adding another parameter improve the model, or is it just overfitting?
- Use information criteria and Bayes factors
5. Think Causally
- Avoid spurious correlations (selection effects, confounders)
- Understand when you can infer causation compared to association
- Design analyses that answer scientific questions (not just “what correlates with what?”)
6. Validate Your Results
- Prior predictive checks: Does my model make sense before seeing data?
- Posterior predictive checks: Does my fitted model reproduce the data?
- Sensitivity analysis: How much do my conclusions depend on assumptions?
What Makes This Course Different?
The philosophy of this course is that you’ll learn to think like a Bayesian, not just apply Bayesian formulas.
Many stats courses teach:
- Formulas (t-tests, ANOVA, regression)
- p-values and null hypothesis testing
- “Cookbook” approaches
This course teaches:
- Generative modeling: Build a model of how your data were generated
- Probability as a tool for inference: What do I believe given what I observed?
- Computational methods: MCMC, variational inference
- Model checking: Does my model make sense? How do I know?
- Causal thinking: What questions can I actually answer with my data?
Why this course matters:
- Astronomy is fundamentally about inference from noisy, incomplete data
- Traditional methods (χ², p-values) are often inadequate for real problems
- Bayesian inference provides a principled, flexible framework
- You’ll use these tools in your research career
What you’ll gain:
- Ability to fit complex models to messy data
- Proper uncertainty quantification
- Skills to validate and criticize your models
- Computational tools (MCMC, VI) used across science
The Workflow You’ll Master
For every data analysis problem, you should start with the simplest model you can possibly think of. For example, you can know that the data has uncertainties in the $x$-direction but simply ignore them in your first model. Ignore everything complicated and just start simple. You’ll know when you’re getting the hang of data analysis when you find yourself feeling totally gross that you are using such a simple model for a problem, but that is the place to start. When you have a simple model that you know works, you can add complexity to that and compare your more complex models to your simple baseline. This is an extraordinary superpower for building complex models and understanding where and why things went wrong.
If you start with the most complex model, you will likely find that it doesn’t work, and it takes a long time to run! When something takes a long time to run and there are a dozen different things that could be causing the problem, you are hosed. Start simple and add complexity one bit at a time. I cannot possibly stress this enough.
Each time you are building a model, you should be following this cycle:
- Ask a scientific question: causal or predictive
- Build a generative model: how were the data produced?
- Encode domain knowledge: priors, constraints
- Check the model before fitting: prior predictive simulation
- Fit the model: analytical when possible, optimization, VI, MCMC
- Diagnose convergence: did the fitting fail? did the sampler work?
- Check the fit: posterior predictive—does it reproduce the data?
- Interpret & communicate: what did we learn?
- Iterate: revise model and compare
This workflow prevents most common mistakes. The rubric of your project-based work is structured to require this workflow, so you should practice it as early and often as you can.
Prerequisites and Expectations
What you need:
- Comfort with calculus and linear algebra
- Basic programming (e.g., Python)
- Willingness to think probabilistically
What you don’t need:
- Prior statistics knowledge (we assume none!)
- Machine learning background
- Measure theory or advanced probability (nice to have, not required)
Course structure:
- Lectures: Concepts, theory, worked examples
- Hands-on coding: You’ll implement methods from scratch
- Projects: Apply to real astronomy problems
A Glimpse of What’s Coming
Week 1: Foundations
- Motivation
- Probability as a tool for reasoning under uncertainty
- Bayesian updating in practice: conjugate priors, sequential learning
Week 2: Workflow & Model Building
- Prior predictive checks and model validation
- Posterior inference: MCMC fundamentals
- Linear models and why everything is a linear model
Week 3: Fitting models to data
- Fitting a line to data, the simple way
- Fitting a line to data, the correct way
- Optimisation
Week 4: Inference
- Fundamentals of Markov-chain Monte Carlo
- Markov-chain Monte Carlo in practice
- Model comparison and selection
Week 5: Advanced Models
- Gaussian Processes I
- Gaussian Processes II
- Hierarchical models
Week 6: Missing Data and Latent Variables
- How missing data and selection effects ruin everything
- Latent variable models
- Time series analysis
By the end: You’ll be able to tackle real research problems with confidence.
Assessments
The assessment structure for this unit is:
Problem Set 1
- Released Friday of Week 1
- Due Friday of Week 2
- 20% of final grade
Problem Set 2
- Released Friday of Week 2
- Due Friday of Week 3
- 20% of final grade
Problem Set 3
- Released Friday of Week 3
- Due Friday of Week 4
- 20% of final grade
Project
- Released Friday of Week 3
- Due Thursday of Week 6
- 40% of final grade
Approach to generative artificial intelligenc (AI) and large language models (LLMs). You are here to learn. LLMs can be useful for searching topics, summarising information, and clarifying understanding. But supplying your problem to a LLM and directly copying the output to your assessment does not help you learn. It builds a cognitive dissonance between you and the problem, and gives you a false feeling that ‘you did stuff’ or ‘you know how to do things’. It can sometimes be difficult to understand where the grey line sits between ’never use the LLM’ and ‘always use the LLM’. Here are some guiding principles for this class:
- If you used the LLM to help you with a problem, and you could not later solve that problem yourself without calling on an LLM again, you have used it too much.
- You should consider treating LLMs like a peer colleague or student2, in that it is reasonable to ask an LLM or a colleague to look over what you have done and highlight anything that may be incorrect. But it would be unreasonable to ask another student or colleague to simply do all your work for you, or to write up a report.
- If you cannot explain what you have submitted, and provide justified reasoning for what you submitted, then you are not the one who should be awarded the grade (the LLM should!)
With that context in mind, you are allowed to use generative artificial intelligence for work in this class under the following conditions:
You must provide the complete transcripts of your interactions with any LLMs or generative AI used as part of this work. If it is determined that generative AI or LLMs were likely used more than what the provided transcripts suggest, or if generative AI or LLMs were likely used and no transcripts are provided, then the student will be reported for academic misconduct. Please don’t do this. Dealing with academic misconduct is the worst part of this job.
All students will participate in a mandatory viva for all submitted work. Your grade will be moderated by your capability to explain what you have done, and to justify your answers.
Data analysis recipes: Fitting a model to data, Hogg, D. W., Bovy, J. and Lang, D., 2010. ↩︎
This useful perspective was provided by Adrian Price-Whelan (Flatiron Institute). ↩︎