Fitting a Line to Data I: Least Squares
Fitting a model to data means finding the parameters that best describe the observed values—but "best" requires a precise definition, and the choice of definition encodes physical assumptions that have real consequences for the result.
This lecture introduces the machinery of building models for the simplest case: fitting a straight line $y = mx + b$ to data with Gaussian errors. We derive the connection between likelihood maximisation and chi-squared minimisation from first principles, develop the matrix formulation that makes the solution exact and computationally efficient, show why the way you visualise the results matters for interpreting parameter uncertainty, and extend the model to handle measurement uncertainty in the $x$-direction.
Even if you know your model is going to be extremely complex, starting with a simplified version will give you a baseline for comparison, and it lets you add complexity without getting into a situation of "nothing is working and I don't know which of the 7 things I added at the same time is causing issues".
This lecture covers steps 1–3 of the principled Bayesian workflow from lecture 4: specifying the data and parameters, writing down the generative model, and deriving the likelihood. Priors (step 4) and posterior sampling via MCMC (step 5) are deliberately deferred to lecture 6, where we revisit this same dataset with the full machinery. Posterior predictive checks (step 6) appear in nascent form here as residual diagnostics in Section 7.
1. The Dataset
To make this concrete, we work throughout this lecture with the 20-point dataset from Hogg, Bovy & Lang (2010), a canonical example in the astronomical data analysis literature. Each data point has:
- An $x$ coordinate with uncertainty $\sigma_x$
- A $y$ coordinate with uncertainty $\sigma_y$
- A correlation coefficient $\rho_{xy}$ between the $x$ and $y$ measurement errors
| $x$ | $y$ | $\sigma_y$ | $\sigma_x$ | $\rho_{xy}$ |
|---|---|---|---|---|
| 201 | 592 | 61 | 9 | −0.84 |
| 244 | 401 | 25 | 4 | +0.31 |
| 47 | 583 | 38 | 11 | +0.64 |
| 287 | 402 | 15 | 7 | −0.27 |
| 203 | 495 | 21 | 5 | −0.33 |
| 58 | 173 | 15 | 9 | +0.67 |
| 210 | 479 | 27 | 4 | −0.02 |
| 202 | 504 | 14 | 4 | −0.05 |
| 198 | 510 | 30 | 11 | −0.84 |
| 158 | 416 | 16 | 7 | −0.69 |
| 165 | 393 | 14 | 5 | +0.30 |
| 201 | 442 | 25 | 5 | −0.46 |
| 157 | 317 | 52 | 5 | −0.03 |
| 131 | 311 | 16 | 6 | +0.50 |
| 166 | 400 | 34 | 6 | +0.73 |
| 160 | 337 | 31 | 5 | −0.52 |
| 186 | 423 | 42 | 9 | +0.90 |
| 125 | 334 | 26 | 8 | +0.40 |
| 218 | 533 | 16 | 6 | −0.78 |
| 146 | 344 | 22 | 5 | −0.56 |
Figure 1 shows the data with three levels of uncertainty representation. Our first model uses only the $y$ uncertainties (top panel); the extensions in Section 9 and the next lecture address the richer structure visible in the lower two panels.
In workflow terms: the data are $\mathcal{D} = \\{(x_i, y_i, \sigma_{y_i})\\}_{i=1}^{20}$ (treating $x_i$ as exact for now) and the parameters are $\boldsymbol{\theta} = (m, b)$—the slope and intercept of the line.

Figure 1: The 20-point Hogg et al. (2010) dataset shown with three levels of uncertainty representation. Top: $y$ uncertainties only—the simplification used through Section 7. Middle: independent $x$ and $y$ error bars ($\sigma_x$ and $\sigma_y$, $\rho_{xy}$ ignored). Bottom: full $2\times2$ covariance ellipses incorporating $\sigma_x$, $\sigma_y$, and the correlation $\rho_{xy}$. The distinction between the middle and bottom panels becomes important in the next lecture.
2. The Likelihood Function
For a straight line with Gaussian measurement errors in $y$, the generative model is:
$$y_i = m x_i + b + \varepsilon_i \qquad \text{where} \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma_{y_i}^2)$$This is the data-generating process: the true line $mx_i + b$ sets the expected value, and Gaussian noise with known variance $\sigma_{y_i}^2$ is added to produce the observed $y_i$. In a DAG, the causal structure is simply $\\{m, b\\} \to \mu_i = mx_i + b \to y_i \leftarrow \sigma_{y_i}$.
The probability of observing $y_i$ given the parameters and the true $x_i$ is therefore:
$$p(y_i \mid x_i, \sigma_{y_i}, m, b) = \frac{1}{\sqrt{2\pi\sigma_{y_i}^2}} \exp\left(-\frac{[y_i - m x_i - b]^2}{2\sigma_{y_i}^2}\right)$$The likelihood is this probability viewed as a function of the parameters $(m, b)$ for fixed observed data. Assuming the observations are independent:
$$\mathcal{L}(m, b) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma_{y_i}^2}} \exp\left(-\frac{[y_i - m x_i - b]^2}{2\sigma_{y_i}^2}\right)$$Maximum likelihood estimation finds the parameter values $(\hat{m}, \hat{b})$ that maximise $\mathcal{L}(m, b)$.
2.1 The Log-Likelihood
Products of probabilities underflow to numerical zero for any realistic dataset, but adding log probabilities is easy. We always work with the log-likelihood:
$$\log \mathcal{L}(m, b) = -\frac{1}{2}\sum_{i=1}^{N} \frac{[y_i - m x_i - b]^2}{\sigma_{y_i}^2} - \sum_{i=1}^{N} \log \sigma_{y_i} - \frac{N}{2}\log(2\pi)$$The second and third terms do not depend on the parameters $(m, b)$ when the $\sigma_{y_i}$ are known and fixed. Dropping them:
$$\log \mathcal{L}(m, b) = -\frac{1}{2} \sum_{i=1}^{N} \frac{[y_i - m x_i - b]^2}{\sigma_{y_i}^2} + \text{const}$$3. From Likelihood to Chi-Squared
Recognise the sum in the log-likelihood: it is the chi-squared statistic:
$$\chi^2 \equiv \sum_{i=1}^{N} \frac{[y_i - f(x_i)]^2}{\sigma_{y_i}^2}$$where $f(x_i) = m x_i + b$ is the model prediction. Therefore:
$$\log \mathcal{L}(m, b) = -\frac{1}{2}\chi^2 + K$$where $K$ is a constant independent of $(m, b)$.
The consequences:
Maximising the likelihood is equivalent to minimising chi-squared—under the assumptions stated above.
Minimising the negative log-likelihood is equivalent to minimising chi-squared. In optimisation code you will often see a function called
nll(negative log-likelihood) ornlp(negative log-posterior) that is minimised—this is equivalent to maximising the likelihood.The likelihood is proportional to $\exp(-\frac{1}{2}\chi^2)$—not to $\chi^2$ itself. This becomes critical when visualising parameter space (Section 6).
The connection breaks down when the assumptions change. If the errors are not Gaussian, or if $\sigma_{y_i}$ contains unknown parameters, the minimum-$\chi^2$ solution is no longer the maximum likelihood solution.
4. The Matrix Formulation
Most problems in data analysis do not have a nice analytic solution: they require good initial guesses of parameters, complex optimisation schedules, and a robust sampler to approximate the posterior. However, for a linear model, the maximum likelihood solution has an exact, closed-form expression. Setting up the problem in matrix form reveals this structure cleanly.
4.1 The Four Matrices
The data are a column vector $\mathbf{Y}$ of $N$ data points:
$$\mathbf{Y} = \begin{bmatrix} y_1 \\\\ y_2 \\\\ \vdots \\\\ y_N \end{bmatrix}$$The data covariance matrix $\mathbf{C}$ is a $N \times N$ diagonal matrix (for independent observations):
$$\mathbf{C} = \begin{bmatrix} \sigma_{y_1}^2 & 0 & \cdots & 0 \\\\ 0 & \sigma_{y_2}^2 & \cdots & 0 \\\\ \vdots & & \ddots & \vdots \\\\ 0 & 0 & \cdots & \sigma_{y_N}^2 \end{bmatrix}$$The design matrix $\mathbf{A}$ is a $N \times P$ matrix, where $P$ is the number of parameters. For a straight line ($P = 2$), each row encodes the model's dependence on the parameters for one data point. The first column of ones corresponds to the intercept $b$; the second column of $x$ values corresponds to the slope $m$:
$$\mathbf{A} = \begin{bmatrix} 1 & x_1 \\\\ 1 & x_2 \\\\ \vdots & \vdots \\\\ 1 & x_N \end{bmatrix}$$Lastly, the parameter vector $\mathbf{X}$ contains the parameters we are fitting:
$$\mathbf{X} = \begin{bmatrix} b \\\\ m \end{bmatrix}$$4.2 Chi-Squared in Matrix Form
With these definitions, the chi-squared statistic is:
$$\chi^2 = [\mathbf{Y} - \mathbf{A}\mathbf{X}]^\top \mathbf{C}^{-1} [\mathbf{Y} - \mathbf{A}\mathbf{X}]$$4.3 The Best-Fit Solution
Differentiating with respect to $\mathbf{X}$ and setting equal to zero yields the normal equations:
$$\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\hat{\mathbf{X}} = \mathbf{A}^\top \mathbf{C}^{-1} \mathbf{Y}$$Solving:
$$\boxed{\hat{\mathbf{X}} = \left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\right]^{-1} \left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{Y}\right]}$$This is the weighted least squares solution. For equal weights ($\mathbf{C} = \sigma^2 \mathbf{I}$), it reduces to ordinary least squares.
import numpy as np
x = np.array([201, 244, 47, 287, 203, 58, 210, 202, 198, 158,
165, 201, 157, 131, 166, 160, 186, 125, 218, 146])
y = np.array([592, 401, 583, 402, 495, 173, 479, 504, 510, 416,
393, 442, 317, 311, 400, 337, 423, 334, 533, 344])
y_err = np.array([61, 25, 38, 15, 21, 15, 27, 14, 30, 16,
14, 25, 52, 16, 34, 31, 42, 26, 16, 22])
A = np.column_stack([np.ones(len(x)), x]) # design matrix (N × 2)
C_inv = np.diag(1.0 / y_err**2) # inverse covariance (N × N)
ATCinvA_inv = np.linalg.inv(A.T @ C_inv @ A) # parameter covariance (2 × 2)
X_hat = ATCinvA_inv @ (A.T @ C_inv @ y) # best-fit parameters [b, m]
b_fit, m_fit = X_hat
print(f"m = {m_fit:.4f} ± {np.sqrt(ATCinvA_inv[1,1]):.4f}")
print(f"b = {b_fit:.2f} ± {np.sqrt(ATCinvA_inv[0,0]):.2f}")For this dataset: $\hat{m} \approx 1.21$, $\hat{b} \approx 200.4$.
Don't use np.linalg.inv in production. For ill-conditioned systems it fails silently—returning a numerically wrong answer without raising an error. Prefer:
X_hat = np.linalg.lstsq(A.T @ C_inv @ A, A.T @ C_inv @ y, rcond=None)[0]which solves the system directly and can be regularised for ill-conditioned cases.
5. Parameter Uncertainties
The best-fit parameters $\hat{\mathbf{X}}$ are point estimates. The covariance matrix of the parameters is:
$$\text{Cov}(\hat{\mathbf{X}}) = \left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\right]^{-1}$$For a straight line, the elements are:
$$\left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\right]^{-1} \equiv \begin{bmatrix} \sigma_b^2 & \sigma_{mb} \\\\ \sigma_{mb} & \sigma_m^2 \end{bmatrix}$$where $\sigma_b^2$ is the variance on the intercept, $\sigma_m^2$ is the variance on the slope, and $\sigma_{mb}$ is the covariance between them. This matrix $\text{Cov}(\hat{\mathbf{X}})$ is already computed as part of the solution above.
sigma_b = np.sqrt(ATCinvA_inv[0, 0])
sigma_m = np.sqrt(ATCinvA_inv[1, 1])
print(f"σ_m = {sigma_m:.4f}, σ_b = {sigma_b:.2f}")Because the likelihood is Gaussian in $(m, b)$ for a linear model, the parameter uncertainty is completely described by $\hat{\mathbf{X}}$ and $\text{Cov}(\hat{\mathbf{X}})$. You can draw samples from this distribution to propagate uncertainty into predictions:
# Draw 1000 parameter samples from p(m, b | data)
samples = np.random.multivariate_normal(X_hat, ATCinvA_inv, size=1000)
b_samples, m_samples = samples.Tnp.random.multivariate_normal is therefore drawing from the posterior predictive distribution $p(\tilde{y} \mid \mathcal{D})$—the faint lines in Figure 2 show what new data would look like under models consistent with the observations. When we move to non-linear models in lecture 6, MCMC replaces the analytic Gaussian and the sampling step is identical in concept.Figure 2 shows the best-fit line and 200 draws from the parameter distribution overlaid on the data.

Figure 2: The 20-point dataset with best-fit line (solid) and 200 parameter samples (faint lines). The spread of faint lines represents the uncertainty in the fitted parameters.
6. Visualising Parameter Space
To understand where parameter uncertainty comes from, it helps to visualise the objective function in the $(m, b)$ plane.
6.1 Two Ways to Plot the Same Thing
For each point $(m, b)$ on a grid, compute $\chi^2$ using the full dataset. Then you have two options:
- Plot $\chi^2$ directly — shows a valley, with smaller $\chi^2$ at the best-fit
- Plot $\exp(-\frac{1}{2}\chi^2)$ — shows the likelihood surface, with a peak at the best-fit
These contain identical information but look very different, and the visual impression of uncertainty is very different.
6.2 Why $\exp(-\frac{1}{2}\chi^2)$ Is Correct
The $n\sigma$ uncertainty ellipse is defined by:
$$\Delta\chi^2 = \chi^2 - \chi^2_\text{min} = \Delta_\text{conf}$$For two parameters, the 1$\sigma$ ($68.3\%$), 2$\sigma$ ($95.4\%$), and 3$\sigma$ ($99.7\%$) contours correspond to $\Delta\chi^2 = 2.30$, $6.17$, and $11.83$ respectively. This $\Delta\chi^2$ boundary aligns with the uncertainty ellipse from $\text{Cov}(\hat{\mathbf{X}})$—but only on the $\exp(-\frac{1}{2}\chi^2)$ surface, not on the raw $\chi^2$ surface.
When you plot raw $\chi^2$, the visual shape of the valley does not directly correspond to probability. When you plot $\exp(-\frac{1}{2}\chi^2)$, the shape of the peak is the probability distribution over parameters.

Figure 3: Left: the raw $\chi^2$ surface over the $(m, b)$ plane. Right: $\exp(-\frac{1}{2}\chi^2) \propto \mathcal{L}$. Both show the same information, but only the right panel correctly represents parameter probability: the 3$\sigma$ covariance ellipse (white) aligns with the $\exp(-\frac{1}{2}\chi^2)$ contour at the appropriate level.
7. Residuals and Model Checking
After fitting, always examine the residuals. The standardised residual for each point is:
$$r_i = \frac{y_i - \hat y_i}{\sigma_{y_i}}$$where $\hat{y}_i = \hat{m} x_i + \hat{b}$. If the model is correct and the errors are accurately specified, these should look like independent draws from $\mathcal{N}(0, 1)$.
y_fit = m_fit * x + b_fit
residuals = (y - y_fit) / y_err
chi2_dof = np.sum(residuals**2) / (len(x) - 2)
print(f"χ²/dof = {chi2_dof:.2f}")
Figure 4: Top: the dataset with the best-fit line. Bottom: standardised residuals. Dashed lines mark $\pm 1\sigma$, dotted lines $\pm 2\sigma$. Points beyond $\pm 2\sigma$ are highlighted.
The reduced chi-squared $\chi^2_\nu = \chi^2 / (N - k)$ provides an overall check:
- $\chi^2_\nu \approx 1$: model fits data consistent with reported errors
- $\chi^2_\nu \gg 1$: model is too simple, or errors are underestimated
- $\chi^2_\nu \ll 1$: errors are overestimated, or model is over-fitted
Do not interpret reduced chi-squared without also examining residual plots—a value near 1 can still hide systematic structure.
8. Communicating What You've Done
Getting the answer is only half the job. You also have to explain it—differently depending on who is listening.
8.1 For a Non-Technical Audience
Lead with the scientific conclusion in plain language, give the number with its uncertainty, and be honest about what the model cannot do.
We found that $y$ increases with $x$ at a rate of $m = 1.21 \pm 0.07$ — meaning each unit increase in $x$ is associated with a 1.21-unit increase in $y$. The relationship appears linear across the range of our data. However, our model is simplified: it ignores the measurement uncertainties in $x$ and assumes every data point follows the same relationship. A more careful analysis would likely change the slope slightly and produce a more realistic uncertainty.
The key discipline here is to not hide the limitations. A result presented without caveats is not more persuasive — it is less trustworthy.
8.2 For a Technical Audience
A technical reader does not need the derivation — they need to know exactly what you did so they can reproduce it, critique it, or build on it. Be succinct and complete.
We fitted a linear model to the 20-point Hogg et al. (2010) dataset using weighted least squares with diagonal covariance (y-errors only). We obtain $m = 1.21 \pm 0.07$ and $b = 200 \pm 11$ ($1\sigma$). The reduced chi-squared is $\chi^2_\nu = 1.7$, indicating the model is too simple: residual inspection suggests a combination of underestimated uncertainties, intrinsic scatter, and possible outlier contamination. We assumed a linear relationship, Gaussian and independent measurement errors in $y$ only, with known and correctly reported $\sigma_{y_i}$. The $x$ measurement uncertainties and inter-point correlations present in the data were not propagated; doing so would widen the posterior on $m$ and shift the slope slightly.
8.3 The Assumptions Behind This Result
Every result is conditional on a set of modelling assumptions. Stating them explicitly — not burying them — is what separates a reproducible scientific result from a number.
| Assumption | What it means | Common violation |
|---|---|---|
| Linear relationship | $f(x_i) = mx_i + b$ is the true model | Curvature, saturation, threshold effects |
| No $x$ uncertainty | $x_i$ are known exactly | Parallax distances, photometric redshifts |
| Gaussian $y$ errors | $\varepsilon_i \sim \mathcal{N}(0, \sigma_{y_i}^2)$ | Poisson-dominated data, outlier contamination |
| Known $\sigma_{y_i}$ | Reported errors are correct | Underestimated systematics, calibration errors |
| Independent errors | $\text{Cov}(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$ | Shared calibration, correlated sky background |
| Correct model | No missing parameters | Intrinsic scatter, hidden covariates, outliers |
9. Uncertainties in the X Direction
The dataset contains $\sigma_x$ and $\rho_{xy}$ that our model so far completely ignores. This section introduces the theoretical framework for handling measurement uncertainty in $x$; the full numerical implementation follows in the next lecture.
9.1 The Problem
When $x$ values have measurement uncertainty, the true $x$ position $\hat{x}_i$ is unknown. Ignoring $\sigma_x$ biases the slope estimate toward zero—a phenomenon called regression dilution or attenuation bias. When $\sigma_x \sim \sigma_y / m$, this bias is significant.

Figure 5: The same dataset shown with both $x$ and $y$ error bars. Ignoring $\sigma_x$ when fitting introduces a systematic bias in the slope.
9.2 Marginalising Over Unknown X Positions
The correct treatment is to marginalise over the unknown true $x$ position for each data point. If the true position $\hat x_i$ is distributed as $\mathcal{N}(x_i, \sigma_{x_i}^2)$, then:
$$p(y_i, x_i \mid \sigma_{x_i}, \sigma_{y_i}, m, b) = \int p(y_i \mid \hat x_i, \sigma_{y_i}, m, b) p(\hat x_i \mid x_i, \sigma_{x_i}) \mathrm{d}\hat x_i$$The integrand is a Gaussian in $\hat x_i$ multiplied by another Gaussian in $\hat x_i$—and the convolution of two Gaussians is a Gaussian. After carrying out the integral analytically:
$$p(y_i, x_i \mid \sigma_{x_i}, \sigma_{y_i}, m, b) \propto \frac{1}{\sqrt{\sigma_{y_i}^2 + m^2 \sigma_{x_i}^2}} \exp\left(-\frac{[y_i - m x_i - b]^2}{2(\sigma_{y_i}^2 + m^2 \sigma_{x_i}^2)}\right)$$The $x$ uncertainty contributes to the effective variance as $m^2 \sigma_{x_i}^2$—the $x$ uncertainty is projected onto the $y$ axis through the slope.
9.3 The Extended Log-Likelihood
The log-likelihood extended to include $x$ uncertainties is:
$$\log \mathcal{L}(m, b) = -\frac{1}{2}\sum_{i=1}^{N} \frac{[y_i - m x_i - b]^2}{\sigma_{y_i}^2 + m^2 \sigma_{x_i}^2} - \frac{1}{2}\sum_{i=1}^{N}\log(\sigma_{y_i}^2 + m^2 \sigma_{x_i}^2) + \text{const}$$Two important consequences follow:
The $\log(\sigma_{y_i}^2 + m^2\sigma_{x_i}^2)$ terms cannot be dropped. Unlike the $y$-only case where the normalisation was a constant, here the effective variance depends on $m$—so its logarithm is part of the objective function.
The solution is no longer linear in the parameters. The $m^2$ dependence in the denominator means the normal equations no longer apply. We must resort to numerical optimisation, which is the topic of the next lecture.
10. Summary
The likelihood $\mathcal{L}(m, b) = \prod_i p(y_i \mid x_i, \sigma_{y_i}, m, b)$ is the objective function. For Gaussian errors with known $\sigma_{y_i}$, maximising likelihood is exactly equivalent to minimising chi-squared.
The matrix formulation gives the exact solution for the $y$-only case:
$$\hat{\mathbf{X}} = \left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\right]^{-1} \mathbf{A}^\top \mathbf{C}^{-1} \mathbf{Y}$$with parameter covariance $\text{Cov}(\hat{\mathbf{X}}) = [\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}]^{-1}$.
Visualising parameter space correctly requires plotting $\exp(-\frac{1}{2}\chi^2)$, not raw $\chi^2$. The covariance ellipse from $\text{Cov}(\hat{\mathbf{X}})$ aligns with the likelihood contours, not the chi-squared contours.
Residuals are the diagnostic tool. Examine them for systematic structure after fitting. The reduced chi-squared $\chi^2_\nu$ provides an overall check but is not a substitute for visual residual inspection.
X uncertainties break the analytic solution. After marginalising over unknown true $x$ positions, the effective variance becomes $\sigma_{y_i}^2 + m^2\sigma_{x_i}^2$, and the log-normalisation term can no longer be dropped. The next lecture covers numerical methods (MAP estimation and MCMC) for solving the resulting non-linear problem, along with further extensions.
Communicating results requires matching the level of detail to the audience. For a non-technical reader: one sentence on the finding, one number with uncertainty, one honest caveat. For a technical reader: likelihood function, inference method, parameter estimates, model check, and a table of assumptions.
Further Reading
- Hogg, D. W., Bovy, J., & Lang, D. (2010). Data analysis recipes: Fitting a model to data. arXiv:1008.4686. [The canonical reference for this lecture—work through it in full]
- Press, W. H., et al. (2007). Numerical Recipes, Chapter 15. [Classical treatment of least-squares fitting and chi-squared]
- Andrae, R., Schulze-Hartung, T., & Melchior, P. (2010). Dos and don'ts of reduced chi-squared. arXiv:1012.3754.