Fitting a Line to Data I: Least Squares

Fitting a model to data means finding the parameters that best describe the observed values—but "best" requires a precise definition, and the choice of definition encodes physical assumptions that have real consequences for the result.

This lecture introduces the machinery of building models for the simplest case: fitting a straight line $y = mx + b$ to data with Gaussian errors. We derive the connection between likelihood maximisation and chi-squared minimisation from first principles, develop the matrix formulation that makes the solution exact and computationally efficient, show why the way you visualise the results matters for interpreting parameter uncertainty, and extend the model to handle measurement uncertainty in the $x$-direction.

Even if you know your model is going to be extremely complex, starting with a simplified version will give you a baseline for comparison, and it lets you add complexity without getting into a situation of "nothing is working and I don't know which of the 7 things I added at the same time is causing issues".

This lecture covers steps 1–3 of the principled Bayesian workflow from lecture 4: specifying the data and parameters, writing down the generative model, and deriving the likelihood. Priors (step 4) and posterior sampling via MCMC (step 5) are deliberately deferred to lecture 6, where we revisit this same dataset with the full machinery. Posterior predictive checks (step 6) appear in nascent form here as residual diagnostics in Section 7.

1. The Dataset

To make this concrete, we work throughout this lecture with the 20-point dataset from Hogg, Bovy & Lang (2010), a canonical example in the astronomical data analysis literature. Each data point has:

An $x$ coordinate with uncertainty $\sigma_x$
A $y$ coordinate with uncertainty $\sigma_y$
A correlation coefficient $\rho_{xy}$ between the $x$ and $y$ measurement errors

$x$	$y$	$\sigma_y$	$\sigma_x$	$\rho_{xy}$
201	592	61	9	−0.84
244	401	25	4	+0.31
47	583	38	11	+0.64
287	402	15	7	−0.27
203	495	21	5	−0.33
58	173	15	9	+0.67
210	479	27	4	−0.02
202	504	14	4	−0.05
198	510	30	11	−0.84
158	416	16	7	−0.69
165	393	14	5	+0.30
201	442	25	5	−0.46
157	317	52	5	−0.03
131	311	16	6	+0.50
166	400	34	6	+0.73
160	337	31	5	−0.52
186	423	42	9	+0.90
125	334	26	8	+0.40
218	533	16	6	−0.78
146	344	22	5	−0.56

Figure 1 shows the data with three levels of uncertainty representation. Our first model uses only the $y$ uncertainties (top panel); the extensions in Section 9 and the next lecture address the richer structure visible in the lower two panels.

In workflow terms: the data are $\mathcal{D} = \\{(x_i, y_i, \sigma_{y_i})\\}_{i=1}^{20}$ (treating $x_i$ as exact for now) and the parameters are $\boldsymbol{\theta} = (m, b)$—the slope and intercept of the line.

Figure 1: The 20-point Hogg et al. (2010) dataset shown with three levels of uncertainty representation. Top: $y$ uncertainties only—the simplification used through Section 7. Middle: independent $x$ and $y$ error bars ($\sigma_x$ and $\sigma_y$, $\rho_{xy}$ ignored). Bottom: full $2\times2$ covariance ellipses incorporating $\sigma_x$, $\sigma_y$, and the correlation $\rho_{xy}$. The distinction between the middle and bottom panels becomes important in the next lecture.

2. The Likelihood Function

For a straight line with Gaussian measurement errors in $y$, the generative model is:

$$y_i = m x_i + b + \varepsilon_i \qquad \text{where} \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma_{y_i}^2)$$

This is the data-generating process: the true line $mx_i + b$ sets the expected value, and Gaussian noise with known variance $\sigma_{y_i}^2$ is added to produce the observed $y_i$. In a DAG, the causal structure is simply $\\{m, b\\} \to \mu_i = mx_i + b \to y_i \leftarrow \sigma_{y_i}$.

The probability of observing $y_i$ given the parameters and the true $x_i$ is therefore:

$$p(y_i \mid x_i, \sigma_{y_i}, m, b) = \frac{1}{\sqrt{2\pi\sigma_{y_i}^2}} \exp\left(-\frac{[y_i - m x_i - b]^2}{2\sigma_{y_i}^2}\right)$$

ℹ️

You know the probability distribution function because you know the data uncertainties are normally distributed! The form of the noise model tells you what probability family you should be using for your likelihood.

The likelihood is this probability viewed as a function of the parameters $(m, b)$ for fixed observed data. Assuming the observations are independent:

$$\mathcal{L}(m, b) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma_{y_i}^2}} \exp\left(-\frac{[y_i - m x_i - b]^2}{2\sigma_{y_i}^2}\right)$$

Maximum likelihood estimation finds the parameter values $(\hat{m}, \hat{b})$ that maximise $\mathcal{L}(m, b)$.

2.1 The Log-Likelihood

Products of probabilities underflow to numerical zero for any realistic dataset, but adding log probabilities is easy. We always work with the log-likelihood:

$$\log \mathcal{L}(m, b) = -\frac{1}{2}\sum_{i=1}^{N} \frac{[y_i - m x_i - b]^2}{\sigma_{y_i}^2} - \sum_{i=1}^{N} \log \sigma_{y_i} - \frac{N}{2}\log(2\pi)$$

The second and third terms do not depend on the parameters $(m, b)$ when the $\sigma_{y_i}$ are known and fixed. Dropping them:

$$\log \mathcal{L}(m, b) = -\frac{1}{2} \sum_{i=1}^{N} \frac{[y_i - m x_i - b]^2}{\sigma_{y_i}^2} + \text{const}$$

⚠️

The constant terms matter when $\sigma$ is unknown. If the $\sigma_{y_i}$ contain an unknown parameter—for example, an intrinsic scatter $\sigma_\text{int}$ added in quadrature—then $\log \sigma_{y_i}$ depends on that parameter and cannot be dropped. We encounter this in the next lecture.

3. From Likelihood to Chi-Squared

Recognise the sum in the log-likelihood: it is the chi-squared statistic:

$$\chi^2 \equiv \sum_{i=1}^{N} \frac{[y_i - f(x_i)]^2}{\sigma_{y_i}^2}$$

where $f(x_i) = m x_i + b$ is the model prediction. Therefore:

$$\log \mathcal{L}(m, b) = -\frac{1}{2}\chi^2 + K$$

where $K$ is a constant independent of $(m, b)$.

The consequences:

Maximising the likelihood is equivalent to minimising chi-squared—under the assumptions stated above.
Minimising the negative log-likelihood is equivalent to minimising chi-squared. In optimisation code you will often see a function called nll (negative log-likelihood) or nlp (negative log-posterior) that is minimised—this is equivalent to maximising the likelihood.
The likelihood is proportional to $\exp(-\frac{1}{2}\chi^2)$—not to $\chi^2$ itself. This becomes critical when visualising parameter space (Section 6).
The connection breaks down when the assumptions change. If the errors are not Gaussian, or if $\sigma_{y_i}$ contains unknown parameters, the minimum-$\chi^2$ solution is no longer the maximum likelihood solution.

4. The Matrix Formulation

Most problems in data analysis do not have a nice analytic solution: they require good initial guesses of parameters, complex optimisation schedules, and a robust sampler to approximate the posterior. However, for a linear model, the maximum likelihood solution has an exact, closed-form expression. Setting up the problem in matrix form reveals this structure cleanly.

4.1 The Four Matrices

The data are a column vector $\mathbf{Y}$ of $N$ data points:

$$\mathbf{Y} = \begin{bmatrix} y_1 \\\\ y_2 \\\\ \vdots \\\\ y_N \end{bmatrix}$$

The data covariance matrix $\mathbf{C}$ is a $N \times N$ diagonal matrix (for independent observations):

$$\mathbf{C} = \begin{bmatrix} \sigma_{y_1}^2 & 0 & \cdots & 0 \\\\ 0 & \sigma_{y_2}^2 & \cdots & 0 \\\\ \vdots & & \ddots & \vdots \\\\ 0 & 0 & \cdots & \sigma_{y_N}^2 \end{bmatrix}$$

The design matrix $\mathbf{A}$ is a $N \times P$ matrix, where $P$ is the number of parameters. For a straight line ($P = 2$), each row encodes the model's dependence on the parameters for one data point. The first column of ones corresponds to the intercept $b$; the second column of $x$ values corresponds to the slope $m$:

$$\mathbf{A} = \begin{bmatrix} 1 & x_1 \\\\ 1 & x_2 \\\\ \vdots & \vdots \\\\ 1 & x_N \end{bmatrix}$$

Lastly, the parameter vector $\mathbf{X}$ contains the parameters we are fitting:

$$\mathbf{X} = \begin{bmatrix} b \\\\ m \end{bmatrix}$$

4.2 Chi-Squared in Matrix Form

With these definitions, the chi-squared statistic is:

$$\chi^2 = [\mathbf{Y} - \mathbf{A}\mathbf{X}]^\top \mathbf{C}^{-1} [\mathbf{Y} - \mathbf{A}\mathbf{X}]$$

4.3 The Best-Fit Solution

Differentiating with respect to $\mathbf{X}$ and setting equal to zero yields the normal equations:

$$\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\hat{\mathbf{X}} = \mathbf{A}^\top \mathbf{C}^{-1} \mathbf{Y}$$

Solving:

$$\boxed{\hat{\mathbf{X}} = \left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\right]^{-1} \left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{Y}\right]}$$

This is the weighted least squares solution. For equal weights ($\mathbf{C} = \sigma^2 \mathbf{I}$), it reduces to ordinary least squares.

import numpy as np

x = np.array([201, 244, 47, 287, 203, 58, 210, 202, 198, 158,
              165, 201, 157, 131, 166, 160, 186, 125, 218, 146])
y = np.array([592, 401, 583, 402, 495, 173, 479, 504, 510, 416,
              393, 442, 317, 311, 400, 337, 423, 334, 533, 344])
y_err = np.array([61, 25, 38, 15, 21, 15, 27, 14, 30, 16,
                  14, 25, 52, 16, 34, 31, 42, 26, 16, 22])

A = np.column_stack([np.ones(len(x)), x])   # design matrix  (N × 2)
C_inv = np.diag(1.0 / y_err**2)             # inverse covariance  (N × N)
ATCinvA_inv = np.linalg.inv(A.T @ C_inv @ A)          # parameter covariance (2 × 2)
X_hat = ATCinvA_inv @ (A.T @ C_inv @ y)               # best-fit parameters  [b, m]
b_fit, m_fit = X_hat

print(f"m = {m_fit:.4f} ± {np.sqrt(ATCinvA_inv[1,1]):.4f}")
print(f"b = {b_fit:.2f} ± {np.sqrt(ATCinvA_inv[0,0]):.2f}")

For this dataset: $\hat{m} \approx 1.21$, $\hat{b} \approx 200.4$.

⚠️

Don't use np.linalg.inv in production. For ill-conditioned systems it fails silently—returning a numerically wrong answer without raising an error. Prefer:

X_hat = np.linalg.lstsq(A.T @ C_inv @ A, A.T @ C_inv @ y, rcond=None)[0]

which solves the system directly and can be regularised for ill-conditioned cases.

5. Parameter Uncertainties

The best-fit parameters $\hat{\mathbf{X}}$ are point estimates. The covariance matrix of the parameters is:

$$\text{Cov}(\hat{\mathbf{X}}) = \left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\right]^{-1}$$

For a straight line, the elements are:

$$\left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\right]^{-1} \equiv \begin{bmatrix} \sigma_b^2 & \sigma_{mb} \\\\ \sigma_{mb} & \sigma_m^2 \end{bmatrix}$$

where $\sigma_b^2$ is the variance on the intercept, $\sigma_m^2$ is the variance on the slope, and $\sigma_{mb}$ is the covariance between them. This matrix $\text{Cov}(\hat{\mathbf{X}})$ is already computed as part of the solution above.

sigma_b = np.sqrt(ATCinvA_inv[0, 0])
sigma_m = np.sqrt(ATCinvA_inv[1, 1])
print(f"σ_m = {sigma_m:.4f},  σ_b = {sigma_b:.2f}")

Because the likelihood is Gaussian in $(m, b)$ for a linear model, the parameter uncertainty is completely described by $\hat{\mathbf{X}}$ and $\text{Cov}(\hat{\mathbf{X}})$. You can draw samples from this distribution to propagate uncertainty into predictions:

# Draw 1000 parameter samples from p(m, b | data)
samples = np.random.multivariate_normal(X_hat, ATCinvA_inv, size=1000)
b_samples, m_samples = samples.T

ℹ️

This is a posterior predictive exercise. In full Bayesian language, $(\hat{\mathbf{X}}, \text{Cov}(\hat{\mathbf{X}}))$ exactly characterise the posterior $p(m, b \mid \mathcal{D})$ when the likelihood is Gaussian and the prior is flat. Drawing lines from np.random.multivariate_normal is therefore drawing from the posterior predictive distribution $p(\tilde{y} \mid \mathcal{D})$—the faint lines in Figure 2 show what new data would look like under models consistent with the observations. When we move to non-linear models in lecture 6, MCMC replaces the analytic Gaussian and the sampling step is identical in concept.

Figure 2 shows the best-fit line and 200 draws from the parameter distribution overlaid on the data.

Figure 2: The 20-point dataset with best-fit line (solid) and 200 parameter samples (faint lines). The spread of faint lines represents the uncertainty in the fitted parameters.

6. Visualising Parameter Space

To understand where parameter uncertainty comes from, it helps to visualise the objective function in the $(m, b)$ plane.

6.1 Two Ways to Plot the Same Thing

For each point $(m, b)$ on a grid, compute $\chi^2$ using the full dataset. Then you have two options:

Plot $\chi^2$ directly — shows a valley, with smaller $\chi^2$ at the best-fit
Plot $\exp(-\frac{1}{2}\chi^2)$ — shows the likelihood surface, with a peak at the best-fit

These contain identical information but look very different, and the visual impression of uncertainty is very different.

6.2 Why $\exp(-\frac{1}{2}\chi^2)$ Is Correct

The $n\sigma$ uncertainty ellipse is defined by:

$$\Delta\chi^2 = \chi^2 - \chi^2_\text{min} = \Delta_\text{conf}$$

For two parameters, the 1$\sigma$ ($68.3\%$), 2$\sigma$ ($95.4\%$), and 3$\sigma$ ($99.7\%$) contours correspond to $\Delta\chi^2 = 2.30$, $6.17$, and $11.83$ respectively. This $\Delta\chi^2$ boundary aligns with the uncertainty ellipse from $\text{Cov}(\hat{\mathbf{X}})$—but only on the $\exp(-\frac{1}{2}\chi^2)$ surface, not on the raw $\chi^2$ surface.

When you plot raw $\chi^2$, the visual shape of the valley does not directly correspond to probability. When you plot $\exp(-\frac{1}{2}\chi^2)$, the shape of the peak is the probability distribution over parameters.

Figure 3: Left: the raw $\chi^2$ surface over the $(m, b)$ plane. Right: $\exp(-\frac{1}{2}\chi^2) \propto \mathcal{L}$. Both show the same information, but only the right panel correctly represents parameter probability: the 3$\sigma$ covariance ellipse (white) aligns with the $\exp(-\frac{1}{2}\chi^2)$ contour at the appropriate level.

ℹ️

The relationship between $\chi^2$ contours and probability depends on the number of free parameters. For one free parameter, the $1\sigma$ interval corresponds to $\Delta\chi^2 = 1$. For two free parameters (like $m$ and $b$ here), the $1\sigma$ ellipse corresponds to $\Delta\chi^2 = 2.30$. Always specify which convention you are using when reporting parameter uncertainties from contour plots.

7. Residuals and Model Checking

After fitting, always examine the residuals. The standardised residual for each point is:

$$r_i = \frac{y_i - \hat y_i}{\sigma_{y_i}}$$

where $\hat{y}_i = \hat{m} x_i + \hat{b}$. If the model is correct and the errors are accurately specified, these should look like independent draws from $\mathcal{N}(0, 1)$.

ℹ️

This is a posterior predictive check (PPC) in disguise. In the Bayesian workflow from lecture 4, a PPC asks: does the fitted model generate data that looks like the observations? Here, the test statistic is the distribution of standardised residuals $r_i$. If the model is well specified, the $r_i$ should be consistent with $\mathcal{N}(0,1)$—which we can check visually and via $\chi^2_\nu$. A systematic pattern in the residuals (e.g., curvature, growing scatter) is a PPC failure that diagnoses a specific model deficiency.

y_fit = m_fit * x + b_fit
residuals = (y - y_fit) / y_err

chi2_dof = np.sum(residuals**2) / (len(x) - 2)
print(f"χ²/dof = {chi2_dof:.2f}")

Figure 4: Top: the dataset with the best-fit line. Bottom: standardised residuals. Dashed lines mark $\pm 1\sigma$, dotted lines $\pm 2\sigma$. Points beyond $\pm 2\sigma$ are highlighted.

The reduced chi-squared $\chi^2_\nu = \chi^2 / (N - k)$ provides an overall check:

$\chi^2_\nu \approx 1$: model fits data consistent with reported errors
$\chi^2_\nu \gg 1$: model is too simple, or errors are underestimated
$\chi^2_\nu \ll 1$: errors are overestimated, or model is over-fitted

Do not interpret reduced chi-squared without also examining residual plots—a value near 1 can still hide systematic structure.

8. Communicating What You've Done

Getting the answer is only half the job. You also have to explain it—differently depending on who is listening.

8.1 For a Non-Technical Audience

Lead with the scientific conclusion in plain language, give the number with its uncertainty, and be honest about what the model cannot do.

We found that $y$ increases with $x$ at a rate of $m = 1.21 \pm 0.07$ — meaning each unit increase in $x$ is associated with a 1.21-unit increase in $y$. The relationship appears linear across the range of our data. However, our model is simplified: it ignores the measurement uncertainties in $x$ and assumes every data point follows the same relationship. A more careful analysis would likely change the slope slightly and produce a more realistic uncertainty.

The key discipline here is to not hide the limitations. A result presented without caveats is not more persuasive — it is less trustworthy.

8.2 For a Technical Audience

A technical reader does not need the derivation — they need to know exactly what you did so they can reproduce it, critique it, or build on it. Be succinct and complete.

We fitted a linear model to the 20-point Hogg et al. (2010) dataset using weighted least squares with diagonal covariance (y-errors only). We obtain $m = 1.21 \pm 0.07$ and $b = 200 \pm 11$ ($1\sigma$). The reduced chi-squared is $\chi^2_\nu = 1.7$, indicating the model is too simple: residual inspection suggests a combination of underestimated uncertainties, intrinsic scatter, and possible outlier contamination. We assumed a linear relationship, Gaussian and independent measurement errors in $y$ only, with known and correctly reported $\sigma_{y_i}$. The $x$ measurement uncertainties and inter-point correlations present in the data were not propagated; doing so would widen the posterior on $m$ and shift the slope slightly.

ℹ️

State what you conditioned on. A technical reader needs to know: the noise model assumed, the inference method used, the parameter estimates with their uncertainties, at least one model check, and which assumptions were made and which were not. If you used priors, state them. If you ignored part of the data (e.g. $x$ uncertainties), say so explicitly.

8.3 The Assumptions Behind This Result

Every result is conditional on a set of modelling assumptions. Stating them explicitly — not burying them — is what separates a reproducible scientific result from a number.

Assumption	What it means	Common violation
Linear relationship	$f(x_i) = mx_i + b$ is the true model	Curvature, saturation, threshold effects
No $x$ uncertainty	$x_i$ are known exactly	Parallax distances, photometric redshifts
Gaussian $y$ errors	$\varepsilon_i \sim \mathcal{N}(0, \sigma_{y_i}^2)$	Poisson-dominated data, outlier contamination
Known $\sigma_{y_i}$	Reported errors are correct	Underestimated systematics, calibration errors
Independent errors	$\text{Cov}(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$	Shared calibration, correlated sky background
Correct model	No missing parameters	Intrinsic scatter, hidden covariates, outliers

⚠️

Rarely are the model assumptions exactly true. As Hogg, Bovy & Lang (2010) note: "Rarely is the generative model we adopt the true model for the data, and rarely do the reported variances correctly describe the uncertainty in the measurements." This is not an excuse to ignore the assumptions — it is an argument for stating them explicitly so that readers can judge for themselves where the result might break down.

9. Uncertainties in the X Direction

The dataset contains $\sigma_x$ and $\rho_{xy}$ that our model so far completely ignores. This section introduces the theoretical framework for handling measurement uncertainty in $x$; the full numerical implementation follows in the next lecture.

9.1 The Problem

When $x$ values have measurement uncertainty, the true $x$ position $\hat{x}_i$ is unknown. Ignoring $\sigma_x$ biases the slope estimate toward zero—a phenomenon called regression dilution or attenuation bias. When $\sigma_x \sim \sigma_y / m$, this bias is significant.

Figure 5: The same dataset shown with both $x$ and $y$ error bars. Ignoring $\sigma_x$ when fitting introduces a systematic bias in the slope.

9.2 Marginalising Over Unknown X Positions

The correct treatment is to marginalise over the unknown true $x$ position for each data point. If the true position $\hat x_i$ is distributed as $\mathcal{N}(x_i, \sigma_{x_i}^2)$, then:

$$p(y_i, x_i \mid \sigma_{x_i}, \sigma_{y_i}, m, b) = \int p(y_i \mid \hat x_i, \sigma_{y_i}, m, b) p(\hat x_i \mid x_i, \sigma_{x_i}) \mathrm{d}\hat x_i$$

The integrand is a Gaussian in $\hat x_i$ multiplied by another Gaussian in $\hat x_i$—and the convolution of two Gaussians is a Gaussian. After carrying out the integral analytically:

$$p(y_i, x_i \mid \sigma_{x_i}, \sigma_{y_i}, m, b) \propto \frac{1}{\sqrt{\sigma_{y_i}^2 + m^2 \sigma_{x_i}^2}} \exp\left(-\frac{[y_i - m x_i - b]^2}{2(\sigma_{y_i}^2 + m^2 \sigma_{x_i}^2)}\right)$$

The $x$ uncertainty contributes to the effective variance as $m^2 \sigma_{x_i}^2$—the $x$ uncertainty is projected onto the $y$ axis through the slope.

ℹ️

Why the factor of $m^2$? If the true $x$ shifts by $\delta x$, the predicted $y = mx + b$ shifts by $m \cdot \delta x$. So an uncertainty $\sigma_x$ in $x$ produces an uncertainty $m \sigma_x$ in the predicted $y$. Adding in quadrature with $\sigma_y$ gives $\sqrt{\sigma_y^2 + m^2 \sigma_x^2}$.

9.3 The Extended Log-Likelihood

The log-likelihood extended to include $x$ uncertainties is:

$$\log \mathcal{L}(m, b) = -\frac{1}{2}\sum_{i=1}^{N} \frac{[y_i - m x_i - b]^2}{\sigma_{y_i}^2 + m^2 \sigma_{x_i}^2} - \frac{1}{2}\sum_{i=1}^{N}\log(\sigma_{y_i}^2 + m^2 \sigma_{x_i}^2) + \text{const}$$

Two important consequences follow:

The $\log(\sigma_{y_i}^2 + m^2\sigma_{x_i}^2)$ terms cannot be dropped. Unlike the $y$-only case where the normalisation was a constant, here the effective variance depends on $m$—so its logarithm is part of the objective function.
The solution is no longer linear in the parameters. The $m^2$ dependence in the denominator means the normal equations no longer apply. We must resort to numerical optimisation, which is the topic of the next lecture.

ℹ️

This is where priors become essential. Once there is no closed-form solution, we need an optimiser or sampler—and both benefit from priors that constrain the search to physically plausible regions. Even a weakly informative prior can regularise the objective and prevents the optimiser from wandering into regions where $m^2 \sigma_{x_i}^2 \gg \sigma_{y_i}^2$ for all data points. Prior specification and MCMC sampling are the topics of lecture 6.

10. Summary

The likelihood $\mathcal{L}(m, b) = \prod_i p(y_i \mid x_i, \sigma_{y_i}, m, b)$ is the objective function. For Gaussian errors with known $\sigma_{y_i}$, maximising likelihood is exactly equivalent to minimising chi-squared.

The matrix formulation gives the exact solution for the $y$-only case:

$$\hat{\mathbf{X}} = \left[\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}\right]^{-1} \mathbf{A}^\top \mathbf{C}^{-1} \mathbf{Y}$$

with parameter covariance $\text{Cov}(\hat{\mathbf{X}}) = [\mathbf{A}^\top \mathbf{C}^{-1} \mathbf{A}]^{-1}$.

Visualising parameter space correctly requires plotting $\exp(-\frac{1}{2}\chi^2)$, not raw $\chi^2$. The covariance ellipse from $\text{Cov}(\hat{\mathbf{X}})$ aligns with the likelihood contours, not the chi-squared contours.

Residuals are the diagnostic tool. Examine them for systematic structure after fitting. The reduced chi-squared $\chi^2_\nu$ provides an overall check but is not a substitute for visual residual inspection.

X uncertainties break the analytic solution. After marginalising over unknown true $x$ positions, the effective variance becomes $\sigma_{y_i}^2 + m^2\sigma_{x_i}^2$, and the log-normalisation term can no longer be dropped. The next lecture covers numerical methods (MAP estimation and MCMC) for solving the resulting non-linear problem, along with further extensions.

Communicating results requires matching the level of detail to the audience. For a non-technical reader: one sentence on the finding, one number with uncertainty, one honest caveat. For a technical reader: likelihood function, inference method, parameter estimates, model check, and a table of assumptions.