Linear Models: Beyond Straight Lines

Linear Models: Beyond Straight Lines

A “linear model” does not mean a straight line. It means the model is linear in its parameters – and that single condition unlocks exact closed-form solutions, automatic uncertainty propagation, and fast computation for an enormous class of problems.

You already know how to fit a straight line $y = mx + b$ to data using weighted least squares. This lecture generalises that machinery. We will see that polynomials, trigonometric series, radial basis function expansions, and principal component models are all linear models – they all share the same matrix equation, the same normal equations, and the same closed-form solution. The only thing that changes is the design matrix.


1. What “Linear” Really Means

A model is linear if the predicted values are a linear combination of the parameters. Concretely:

$$\mathbf{y} = \mathbf{A}\boldsymbol{\theta} + \boldsymbol{\varepsilon}$$

where $\mathbf{y}$ is the $N \times 1$ data vector, $\mathbf{A}$ is the $N \times P$ design matrix, $\boldsymbol{\theta}$ is the $P \times 1$ parameter vector, and $\boldsymbol{\varepsilon}$ is noise. The key requirement is that $\mathbf{A}$ does not depend on $\boldsymbol{\theta}$ – it can be any function of the independent variables $x_i$, but the parameters $\boldsymbol{\theta}$ enter only as a linear combination.

This means the following are all linear models:

  • Polynomial: $f(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_d x^d$ (a degree-$d$ polynomial has $P = d+1$ parameters)
  • Trigonometric: $f(x) = \theta_0 + \theta_1 \sin(\omega x) + \theta_2 \cos(\omega x) + \cdots$
  • Radial basis functions: $f(x) = \theta_1 \phi_1(x) + \theta_2 \phi_2(x) + \cdots$ where $\phi_j(x) = \exp\left(-\frac{|x - c_j|^2}{2\ell^2}\right)$
  • Spherical harmonics: $f(\vartheta, \varphi) = \sum_{\ell,m} \theta_{\ell m} Y_\ell^m(\vartheta, \varphi)$ — for data on a sphere (e.g., CMB maps, stellar surfaces, gravitational fields)
  • B-splines: piecewise polynomials joined smoothly at knot points — extremely common for smooth interpolation
  • Principal components: $f(\lambda) = \theta_1 \text{PC}_1(\lambda) + \theta_2 \text{PC}_2(\lambda) + \cdots$

None of these are “straight lines”, but they are all linear in $\boldsymbol{\theta}$.

⚠️
Not all models are linear. The model $f(x) = A\exp(-x/\tau)$ is not linear in $\tau$, even though it is linear in $A$. The model $f(x) = A\sin(\omega x + \phi)$ is not linear in $\omega$ or $\phi$, but if you fix $\omega$ and expand using the identity $A\sin(\omega x + \phi) = a\sin(\omega x) + b\cos(\omega x)$, it becomes linear in $a$ and $b$. The distinction matters: linear models have exact solutions; nonlinear models require iterative optimisation.

Figure 1

Figure 1: Three models that are all “linear models” despite not being straight lines. (a) A cubic polynomial, (b) a sum of sines and cosines, (c) a sum of radial basis functions. In each case, the model is a linear combination of known basis functions – only the coefficients are unknown.


2. The Design Matrix

The design matrix $\mathbf{A}$ is the central object. Each row corresponds to one data point, and each column corresponds to one basis function evaluated at all data points. The model prediction is always $\hat{\mathbf{y}} = \mathbf{A}\boldsymbol{\theta}$.

2.1 Polynomial Basis

For a polynomial of degree $d$ evaluated at $N$ data points:

$$\mathbf{A}_\text{poly} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \\ 1 & x_2 & x_2^2 & \cdots & x_2^d \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_N & x_N^2 & \cdots & x_N^d \end{bmatrix}$$

This is a Vandermonde matrix with $P = d + 1$ columns. The parameter vector $\boldsymbol{\theta} = [\theta_0, \theta_1, \ldots, \theta_d]^\top$ gives the polynomial coefficients.

import numpy as np

def design_polynomial(x, degree):
    """Build the Vandermonde design matrix for a polynomial of given degree."""
    return np.column_stack([x**k for k in range(degree + 1)])

# Example: cubic polynomial (4 columns: 1, x, x^2, x^3)
x = np.linspace(0, 1, 20)
A = design_polynomial(x, degree=3)   # shape (20, 4)

2.2 Fourier (Trigonometric) Basis

For a truncated Fourier series with $K$ harmonics:

$$\mathbf{A}_\text{Fourier} = \begin{bmatrix} 1 & \sin(\omega x_1) & \cos(\omega x_1) & \sin(2\omega x_1) & \cos(2\omega x_1) & \cdots \\ 1 & \sin(\omega x_2) & \cos(\omega x_2) & \sin(2\omega x_2) & \cos(2\omega x_2) & \cdots \\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix}$$

This has $P = 2K + 1$ columns (one constant, plus a sine and cosine for each harmonic).

def design_fourier(x, n_harmonics, omega=2*np.pi):
    """Build the design matrix for a truncated Fourier series."""
    columns = [np.ones_like(x)]
    for k in range(1, n_harmonics + 1):
        columns.append(np.sin(k * omega * x))
        columns.append(np.cos(k * omega * x))
    return np.column_stack(columns)

2.3 Radial Basis Functions (RBFs)

RBFs place Gaussian bumps at fixed centres $c_j$ with a fixed width $\ell$:

$$\phi_j(x) = \exp\left(-\frac{(x - c_j)^2}{2\ell^2}\right)$$

The design matrix has one column per centre:

def design_rbf(x, centres, length_scale):
    """Build the design matrix for Gaussian radial basis functions."""
    return np.column_stack([
        np.exp(-0.5 * ((x - c) / length_scale)**2) for c in centres
    ])

2.4 Spherical Harmonics

For data defined on a sphere (sky maps, stellar surface models, gravitational potential fields), the natural basis functions are the spherical harmonics $Y_\ell^m(\vartheta, \varphi)$, where $\ell$ is the degree and $m$ is the order ($-\ell \le m \le \ell$). A model truncated at maximum degree $L$ has $P = (L+1)^2$ parameters:

$$f(\vartheta, \varphi) = \sum_{\ell=0}^{L} \sum_{m=-\ell}^{\ell} \theta_{\ell m} Y_\ell^m(\vartheta, \varphi)$$

This is a linear model — the $Y_\ell^m$ are known functions and only the $\theta_{\ell m}$ are free. The spherical harmonics are orthogonal over the sphere, so they share the same computational advantages as the Fourier basis: on a full-sky uniform grid, $\mathbf{A}^\top\mathbf{A}$ is diagonal.

Spherical harmonics are ubiquitous in physics: the CMB power spectrum is defined in terms of spherical harmonic coefficients, gravitational multipole expansions use them, and stellar pulsation modes are described by them.

from scipy.special import sph_harm

def design_spherical_harmonics(theta, phi, L_max):
    """Build design matrix for real spherical harmonics up to degree L_max."""
    columns = []
    for ell in range(L_max + 1):
        for m in range(-ell, ell + 1):
            Y = sph_harm(abs(m), ell, phi, theta)
            if m < 0:
                columns.append(np.sqrt(2) * Y.imag)
            elif m == 0:
                columns.append(Y.real)
            else:
                columns.append(np.sqrt(2) * Y.real)
    return np.column_stack(columns)

2.5 B-Splines

B-splines are piecewise polynomials joined smoothly at specified knot points. A B-spline basis of order $k$ consists of local, compactly supported functions — each one is nonzero over only $k+1$ knot intervals. This makes the design matrix $\mathbf{A}$ sparse (mostly zeros), which is excellent for large datasets.

B-splines are particularly useful for modelling smooth, non-periodic functions where you want local control — changing a coefficient only affects the fit near that knot, not globally. They are widely used in spectral continuum fitting and instrument calibration.

from scipy.interpolate import BSpline

def design_bspline(x, knots, degree=3):
    """Build design matrix for B-spline basis."""
    n_basis = len(knots) - degree - 1
    columns = []
    for i in range(n_basis):
        c = np.zeros(n_basis)
        c[i] = 1.0
        spline = BSpline(knots, c, degree)
        columns.append(spline(x))
    return np.column_stack(columns)

2.6 Reduced-Order Models / Principal Components

In spectroscopy, you may have a set of template spectra (e.g., from PCA of a training set). Each observed spectrum is modelled as a linear combination of these templates:

$$f(\lambda) = \sum_{j=1}^{P} \theta_j \text{PC}_j(\lambda)$$

The design matrix has one column per principal component, evaluated at the observed wavelengths. This is widely used in photometric redshift estimation and spectral classification.

Figure 2

Figure 2: Visualisation of design matrices. Left: polynomial basis (degree 5, so $P = 6$ columns). Right: Fourier basis ($K = 5$ harmonics, so $P = 11$ columns). Each column represents a basis function evaluated at the data points. The structure of $\mathbf{A}$ determines the flexibility and properties of the model.


3. The Normal Equations

In lecture 5, we derived the weighted least-squares solution for a straight line. The derivation generalises immediately to any linear model.

3.1 The Objective Function

The chi-squared statistic for a linear model with data covariance $\mathbf{C}$ is:

$$\chi^2(\boldsymbol{\theta}) = (\mathbf{y} - \mathbf{A}\boldsymbol{\theta})^\top \mathbf{C}^{-1} (\mathbf{y} - \mathbf{A}\boldsymbol{\theta})$$

This is a quadratic function of $\boldsymbol{\theta}$.

3.2 Derivation

To find the minimum, expand and differentiate with respect to $\boldsymbol{\theta}$:

$$\chi^2 = \mathbf{y}^\top\mathbf{C}^{-1}\mathbf{y} - 2\boldsymbol{\theta}^\top\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{y} + \boldsymbol{\theta}^\top\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}\boldsymbol{\theta}$$

Setting $\frac{\partial \chi^2}{\partial \boldsymbol{\theta}} = 0$:

$$-2\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{y} + 2\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}\boldsymbol{\theta} = 0$$

These are the normal equations:

$$\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}\hat{\boldsymbol{\theta}} = \mathbf{A}^\top\mathbf{C}^{-1}\mathbf{y}$$

Solving:

$$\boxed{\hat{\boldsymbol{\theta}} = \left(\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}\right)^{-1}\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{y}}$$

The parameter covariance matrix is:

$$\boxed{\boldsymbol{\Sigma}_\theta = \left(\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}\right)^{-1}}$$

ℹ️
This is simultaneously the maximum likelihood solution AND the posterior mean under a flat prior. Because the log-likelihood is quadratic in $\boldsymbol{\theta}$, the posterior $p(\boldsymbol{\theta} \mid \mathbf{y}) \propto \exp(-\frac{1}{2}\chi^2)$ is a multivariate Gaussian with mean $\hat{\boldsymbol{\theta}}$ and covariance $\boldsymbol{\Sigma}_\theta$. The mode, mean, and median of a Gaussian are all the same point. The solution is exact – no iteration, no initialisation, no convergence checks.

3.3 Python Implementation

import numpy as np

def fit_linear_model(A, y, C_inv):
    """
    Solve the normal equations for a linear model.

    Parameters
    ----------
    A : (N, P) design matrix
    y : (N,) data vector
    C_inv : (N, N) inverse data covariance matrix

    Returns
    -------
    theta_hat : (P,) best-fit parameters
    Sigma_theta : (P, P) parameter covariance
    """
    # Normal equations: (A^T C^{-1} A) theta = A^T C^{-1} y
    M = A.T @ C_inv @ A             # (P, P)
    rhs = A.T @ C_inv @ y           # (P,)
    Sigma_theta = np.linalg.inv(M)  # parameter covariance
    theta_hat = Sigma_theta @ rhs   # best-fit parameters
    return theta_hat, Sigma_theta

# Example: fit a cubic polynomial to noisy data
np.random.seed(42)
x = np.sort(np.random.uniform(0, 5, 25))
y_true = 2.0 - 0.5*x + 0.3*x**2 - 0.02*x**3
y = y_true + np.random.normal(0, 0.5, len(x))
y_err = 0.5 * np.ones(len(x))

A = design_polynomial(x, degree=3)
C_inv = np.diag(1.0 / y_err**2)
theta_hat, Sigma_theta = fit_linear_model(A, y, C_inv)

for k, (val, err) in enumerate(zip(theta_hat, np.sqrt(np.diag(Sigma_theta)))):
    print(f"theta_{k} = {val:.4f} +/- {err:.4f}")
⚠️

Use np.linalg.lstsq or np.linalg.solve instead of np.linalg.inv in production. Matrix inversion is numerically unstable for ill-conditioned systems. The implementation above is pedagogically clear but should be replaced with:

theta_hat = np.linalg.solve(M, rhs)

which solves the linear system directly without forming the inverse.


4. Orthogonality and Why It Matters

4.1 The Ideal Case

When the columns of $\mathbf{A}$ are orthogonal with respect to the data covariance (i.e., $\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}$ is diagonal), the normal equations decouple. Each parameter $\theta_j$ can be estimated independently of all the others:

$$\hat{\theta}_j = \frac{\mathbf{a}_j^\top \mathbf{C}^{-1} \mathbf{y}}{\mathbf{a}_j^\top \mathbf{C}^{-1} \mathbf{a}_j}$$

where $\mathbf{a}_j$ is the $j$-th column of $\mathbf{A}$.

This has three consequences:

  1. Speed: No matrix inversion needed – just $P$ scalar divisions.
  2. Interpretability: Changing or removing one basis function does not change the coefficients of the others.
  3. Numerical stability: Diagonal systems are perfectly conditioned.

4.2 When Does This Happen?

The Fourier basis is exactly orthogonal when the data are evenly spaced over one period. This is why the Fast Fourier Transform (FFT) is so powerful – it exploits this orthogonality to compute $N$ Fourier coefficients in $O(N \log N)$ time.

Polynomial bases, on the other hand, become severely ill-conditioned as the degree increases. The columns $[1, x, x^2, \ldots]$ become nearly linearly dependent, making $\mathbf{A}^\top\mathbf{A}$ nearly singular. This is one reason why high-degree polynomial fitting is numerically dangerous.

Figure 3

Figure 3: The matrix $\mathbf{A}^\top\mathbf{A}$ for two different bases. Left: Fourier basis on evenly-spaced data – the matrix is nearly diagonal, meaning the parameters are independent. Right: polynomial basis – the matrix has large off-diagonal elements, meaning the parameters are strongly correlated and the system is ill-conditioned.

ℹ️
Orthogonal polynomials. You can construct polynomial-like bases that are orthogonal over a given set of data points. Legendre, Chebyshev, and Hermite polynomials are classical examples. NumPy’s np.polynomial.legendre.legvander builds a Vandermonde matrix using Legendre polynomials, which is much better conditioned than the standard monomial basis.

5. Numerical Conditioning: Why Polynomials Go Wrong

5.1 The Vandermonde Disaster

The polynomial design matrix $\mathbf{A}_\text{poly}$ is a Vandermonde matrix, and Vandermonde matrices are notorious for catastrophic ill-conditioning. The problem worsens rapidly with polynomial degree.

Consider fitting a degree-15 polynomial to data spanning $x \in [0, 100]$. The design matrix has columns $[1, x, x^2, \ldots, x^{15}]$. The first column has entries of order $1$; the last column has entries of order $100^{15} = 10^{30}$. The ratio of the largest to the smallest singular value (the condition number $\kappa$) can exceed $10^{20}$.

When $\kappa$ is large, the solution of the normal equations loses approximately $\log_{10}(\kappa)$ digits of precision. With double-precision floating point giving ~16 significant digits, a condition number of $10^{16}$ or larger means the solution has zero reliable digits. The matrix equation is solved, but the answer is garbage — this is catastrophic cancellation.

# Demonstrate catastrophic conditioning
x = np.linspace(0, 100, 50)
for degree in [5, 10, 15, 20]:
    A = np.column_stack([x**k for k in range(degree + 1)])
    kappa = np.linalg.cond(A)
    print(f"Degree {degree:2d}: condition number = {kappa:.1e}")
# Degree  5: condition number ≈ 1e+10
# Degree 10: condition number ≈ 1e+21
# Degree 15: condition number ≈ 1e+32  ← beyond double precision!
# Degree 20: condition number ≈ 1e+43
⚠️
This is a silent failure. NumPy will happily return a “solution” even when the condition number is $10^{30}$ — it does not raise an error. The numbers you get back look reasonable but are dominated by round-off error. Always check np.linalg.cond(A) before trusting a polynomial fit. If $\kappa \gtrsim 10^{12}$, your results are suspect.

5.2 The Fix: Centre and Scale

The simplest remedy is to normalise the input data to a well-behaved range before constructing the design matrix. Map $x$ to the interval $[-1, 1]$:

$$\tilde{x}i = \frac{2(x_i - x\min)}{x_\max - x_\min} - 1$$

Now the columns $[\tilde{x}^0, \tilde{x}^1, \ldots, \tilde{x}^d]$ all have entries in $[-1, 1]$, and the condition number drops dramatically:

# Compare condition numbers: raw vs normalised
x = np.linspace(0, 100, 50)
x_tilde = 2 * (x - x.min()) / (x.max() - x.min()) - 1   # map to [-1, 1]

for degree in [5, 10, 15, 20]:
    A_raw = np.column_stack([x**k for k in range(degree + 1)])
    A_norm = np.column_stack([x_tilde**k for k in range(degree + 1)])
    print(f"Degree {degree:2d}: raw κ = {np.linalg.cond(A_raw):.1e}, "
          f"normalised κ = {np.linalg.cond(A_norm):.1e}")
# Degree  5: raw κ ≈ 1e+10, normalised κ ≈ 1e+02
# Degree 10: raw κ ≈ 1e+21, normalised κ ≈ 1e+04
# Degree 15: raw κ ≈ 1e+32, normalised κ ≈ 1e+06
# Degree 20: raw κ ≈ 1e+43, normalised κ ≈ 1e+08

This is a massive improvement — but even with normalisation, the monomial basis still becomes ill-conditioned at high degree.

5.3 The Better Fix: Use Orthogonal Polynomials

The real solution is to replace the monomial basis ${1, x, x^2, \ldots}$ with an orthogonal polynomial basis. Chebyshev polynomials $T_k(\tilde{x})$ and Legendre polynomials $P_k(\tilde{x})$ are defined on $[-1, 1]$ and are orthogonal, meaning $\mathbf{A}^\top\mathbf{A}$ is approximately diagonal for uniformly distributed data. The condition number stays small even at high degree.

# Orthogonal polynomial basis using NumPy
from numpy.polynomial import chebyshev, legendre

x = np.linspace(0, 100, 50)
x_tilde = 2 * (x - x.min()) / (x.max() - x.min()) - 1

degree = 20
A_cheb = chebyshev.chebvander(x_tilde, degree)   # Chebyshev Vandermonde
A_leg = legendre.legvander(x_tilde, degree)       # Legendre Vandermonde

print(f"Monomial κ  = {np.linalg.cond(np.column_stack([x_tilde**k for k in range(degree+1)])):.1e}")
print(f"Chebyshev κ = {np.linalg.cond(A_cheb):.1e}")
print(f"Legendre κ  = {np.linalg.cond(A_leg):.1e}")

5.4 Worked Example: Watching It Break

The condition numbers above are abstract. Let us see what actually happens when you fit real data. We generate 40 data points from a smooth function on $x \in [0, 100]$ and attempt a degree-15 polynomial fit using three different approaches.

import numpy as np
import matplotlib.pyplot as plt
from numpy.polynomial import chebyshev, legendre

# -----------------------------------------------------------
# 1. Generate fake data
# -----------------------------------------------------------
np.random.seed(42)
N = 40
x = np.sort(np.random.uniform(0, 100, N))
y_true = np.sin(x / 15) + 0.3 * np.cos(x / 5) + 0.5    # smooth, wiggly truth
sigma = 0.15
y = y_true + np.random.normal(0, sigma, N)

degree = 15   # high enough to cause trouble with raw monomials

# -----------------------------------------------------------
# 2a. Raw monomials on the original x (catastrophic)
# -----------------------------------------------------------
A_raw = np.column_stack([x**k for k in range(degree + 1)])
print(f"Raw monomial condition number: {np.linalg.cond(A_raw):.1e}")

# Solve using lstsq (the most numerically careful solver available)
theta_raw, *_ = np.linalg.lstsq(A_raw, y, rcond=None)

# Evaluate on a fine grid
x_fine = np.linspace(0, 100, 500)
A_fine_raw = np.column_stack([x_fine**k for k in range(degree + 1)])
y_fit_raw = A_fine_raw @ theta_raw

# -----------------------------------------------------------
# 2b. Normalised monomials: map x -> [-1, 1]
# -----------------------------------------------------------
x_tilde = 2 * (x - x.min()) / (x.max() - x.min()) - 1
x_fine_tilde = 2 * (x_fine - x.min()) / (x.max() - x.min()) - 1

A_norm = np.column_stack([x_tilde**k for k in range(degree + 1)])
print(f"Normalised monomial condition number: {np.linalg.cond(A_norm):.1e}")

theta_norm, *_ = np.linalg.lstsq(A_norm, y, rcond=None)

A_fine_norm = np.column_stack([x_fine_tilde**k for k in range(degree + 1)])
y_fit_norm = A_fine_norm @ theta_norm

# -----------------------------------------------------------
# 2c. Chebyshev polynomials on [-1, 1]
# -----------------------------------------------------------
A_cheb = chebyshev.chebvander(x_tilde, degree)
print(f"Chebyshev condition number: {np.linalg.cond(A_cheb):.1e}")

theta_cheb, *_ = np.linalg.lstsq(A_cheb, y, rcond=None)

A_fine_cheb = chebyshev.chebvander(x_fine_tilde, degree)
y_fit_cheb = A_fine_cheb @ theta_cheb

# -----------------------------------------------------------
# 3. Plot the results
# -----------------------------------------------------------
fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True)

for ax, y_fit, label, kappa in zip(
    axes,
    [y_fit_raw, y_fit_norm, y_fit_cheb],
    ["Raw monomials", "Normalised monomials", "Chebyshev polynomials"],
    [np.linalg.cond(A_raw), np.linalg.cond(A_norm), np.linalg.cond(A_cheb)],
):
    ax.errorbar(x, y, yerr=sigma, fmt="o", ms=4, color="0.3", zorder=5)
    ax.plot(x_fine, np.sin(x_fine / 15) + 0.3 * np.cos(x_fine / 5) + 0.5,
            "k--", lw=1, label="truth")
    ax.plot(x_fine, y_fit, "C0-", lw=2, label="fit")
    ax.set_title(f"{label}\n$\\kappa = ${kappa:.1e}")
    ax.set_xlabel("$x$")
    ax.set_ylim(-1, 3)
    ax.legend(fontsize=8)

axes[0].set_ylabel("$y$")
fig.tight_layout()
plt.savefig("polynomial_conditioning_demo.png", dpi=150)
plt.show()

When you run this, you will see three very different outcomes from what should be the same fit:

  • Left panel (raw monomials): The condition number is $\sim 10^{32}$, far beyond double-precision limits. The “fit” oscillates wildly or produces nonsense — the coefficients are dominated by round-off error. NumPy does not warn you.
  • Middle panel (normalised monomials): Mapping $x$ to $[-1, 1]$ drops $\kappa$ to $\sim 10^{6}$, which is tractable. The fit tracks the true function well.
  • Right panel (Chebyshev polynomials): The condition number is $\sim 10^{1}$. The fit is virtually identical to the normalised case, but the numerical solution is even more reliable.

The normalised monomials and Chebyshev fits are visually indistinguishable — both recover the truth. The raw monomial fit is garbage. This is not a theoretical concern: it is exactly the kind of silent bug that corrupts results in real research.

Figure: polynomial conditioning demo

Figure: Degree-15 polynomial fit to 40 data points using three bases. Left: raw monomials on $x \in [0, 100]$ — $\kappa \approx 10^{31}$, catastrophic round-off error, the “fit” is nonsense. Middle: normalised monomials with $x$ mapped to $[-1, 1]$ — $\kappa \approx 10^{5}$, the fit recovers the truth. Right: Chebyshev polynomials — $\kappa \approx 10^{1}$, near-perfect conditioning. The normalised and Chebyshev results are visually identical; the raw monomial result is not.

ℹ️
Practical rule. If you are fitting polynomials: (1) always normalise $x$ to $[-1, 1]$, (2) prefer Chebyshev or Legendre polynomials over raw monomials, and (3) check the condition number. These steps cost nothing and prevent silent numerical disasters. The same principle applies to any basis: normalise inputs to a sensible range before constructing the design matrix.

6. Underdetermined Systems ($P > N$)

When the number of parameters $P$ exceeds the number of data points $N$, the matrix $\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}$ is singular (rank at most $N < P$). The normal equations have infinitely many solutions – the system is underdetermined.

Geometrically, this means there is a $(P - N)$-dimensional subspace of parameter vectors that all fit the data perfectly. The minimum-norm solution (via the pseudoinverse) picks the parameter vector with the smallest $|\boldsymbol{\theta}|$. For the unweighted case ($\mathbf{C} = \mathbf{I}$):

$$\hat{\boldsymbol{\theta}}_\text{min-norm} = \mathbf{A}^\top(\mathbf{A}\mathbf{A}^\top)^{-1}\mathbf{y}$$

But using the minimum-norm solution without thinking is dangerous: it may not correspond to anything physically meaningful, and it is extremely sensitive to noise. What you really need in this situation is regularisation.


7. Regularisation as a Prior

7.1 Ridge Regression = Gaussian Prior

If you have more basis functions than you strictly need, the unregularised fit will pass through every data point and oscillate wildly between them. This is overfitting.

Ridge regression adds a penalty on the size of the parameters:

$$\chi^2_\text{reg}(\boldsymbol{\theta}) = (\mathbf{y} - \mathbf{A}\boldsymbol{\theta})^\top \mathbf{C}^{-1} (\mathbf{y} - \mathbf{A}\boldsymbol{\theta}) + \lambda \boldsymbol{\theta}^\top\boldsymbol{\theta}$$

The second term penalises large parameter values. The regularised normal equations are:

$$\boxed{\hat{\boldsymbol{\theta}}_\text{ridge} = \left(\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A} + \lambda\mathbf{I}\right)^{-1}\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{y}}$$

7.2 The Bayesian Interpretation

This is exactly the MAP estimate under a Gaussian prior on the parameters:

$$p(\boldsymbol{\theta}) = \mathcal{N}(\mathbf{0}, \lambda^{-1}\mathbf{I})$$

The log-posterior is:

$$\log p(\boldsymbol{\theta} \mid \mathbf{y}) = -\frac{1}{2}\left[(\mathbf{y} - \mathbf{A}\boldsymbol{\theta})^\top \mathbf{C}^{-1} (\mathbf{y} - \mathbf{A}\boldsymbol{\theta}) + \lambda\boldsymbol{\theta}^\top\boldsymbol{\theta}\right] + \text{const}$$

which is exactly $-\frac{1}{2}\chi^2_\text{reg} + \text{const}$.

The regularisation strength $\lambda$ controls how strongly the prior pulls the parameters toward zero. Large $\lambda$ means a tight prior (strong regularisation, smoother fits); small $\lambda$ means a weak prior (less regularisation, more flexibility).

ℹ️
Every regularisation scheme is a prior. Ridge regression ($L_2$ penalty) corresponds to a Gaussian prior. LASSO ($L_1$ penalty) corresponds to a Laplace prior, which encourages sparsity (many parameters exactly zero). Thinking of regularisation as a prior makes it clear that the choice of regularisation is a modelling decision, not just a numerical trick.

7.3 Practical Example

Consider fitting a Fourier series with $K = 30$ harmonics ($P = 61$ parameters) to $N = 25$ noisy data points. Without regularisation, the fit oscillates wildly. With regularisation, the high-frequency terms are suppressed and the fit smoothly captures the underlying trend.

A plain isotropic ridge penalty ($\lambda \mathbf{I}$) shrinks all coefficients equally, which dampens the signal along with the noise. A better choice for a Fourier basis is a frequency-dependent penalty that increases with harmonic number $k$: the prior says “low-frequency variations are more plausible than high-frequency ones”. This replaces $\lambda \mathbf{I}$ with a diagonal matrix $\boldsymbol{\Lambda}$ where $\Lambda_{jj} \propto k^2$ for the $k$-th harmonic:

# Fit a high-order Fourier series with and without regularisation
x = np.sort(np.random.uniform(0, 1, 25))
y_true = np.sin(2 * np.pi * x) + 0.5 * np.cos(4 * np.pi * x)
y = y_true + np.random.normal(0, 0.3, len(x))
y_err = 0.3 * np.ones(len(x))

K = 30
A = design_fourier(x, n_harmonics=K, omega=2*np.pi)
P = 2 * K + 1
C_inv = np.diag(1.0 / y_err**2)
M = A.T @ C_inv @ A
rhs = A.T @ C_inv @ y

# Unregularised
theta_unreg = np.linalg.lstsq(M, rhs, rcond=None)[0]

# Regularised with frequency-dependent penalty
lam_base = 50.0
penalty = np.zeros(P)
for k in range(1, K + 1):
    penalty[2*k - 1] = lam_base * k**2   # sin(k*omega*x)
    penalty[2*k]     = lam_base * k**2   # cos(k*omega*x)
Lambda = np.diag(penalty)
theta_reg = np.linalg.solve(M + Lambda, rhs)

The Bayesian interpretation: this penalty corresponds to a Gaussian prior $p(\theta_j) = \mathcal{N}(0, 1/(\lambda_0 k^2))$ — higher harmonics have a tighter prior, encoding the belief that smooth functions have small high-frequency coefficients.

Figure 4

Figure 4: Fitting a Fourier series with $P = 61$ parameters to $N = 25$ data points. Left: without regularisation, the model oscillates wildly between data points. Right: with a frequency-dependent regularisation penalty ($\lambda_k \propto k^2$), the high-frequency components are suppressed while the low-frequency signal (the first two harmonics) is preserved. The regularised fit tracks the true function (dashed line) closely.


8. Convexity

The least-squares objective function $\chi^2(\boldsymbol{\theta})$ is a convex function – specifically, it is a positive semi-definite quadratic form in $\boldsymbol{\theta}$ (strictly positive-definite when $\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}$ is invertible). This means:

  1. Any local minimum is the global minimum. There is no risk of getting stuck in a false minimum.
  2. Gradient descent will always converge (with a suitable step size).
  3. The solution is unique (assuming $\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}$ is invertible).

The regularised objective $\chi^2_\text{reg}$ is also convex, since adding a positive-definite quadratic $\lambda\boldsymbol{\theta}^\top\boldsymbol{\theta}$ preserves convexity. In fact, adding the ridge penalty makes the problem strictly convex even when the original problem is not (i.e., when $\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}$ is singular).

Figure 5

Figure 5: Contour plot of the least-squares objective function for a 2-parameter linear model. The objective is a convex quadratic – every contour is an ellipse, and there is a single global minimum. This is the fundamental reason linear models are computationally attractive: you never get stuck in local minima.

⚠️
Convexity is the exception, not the rule. Most real-world models are nonlinear in their parameters, making the objective function non-convex with multiple local minima. When you have a linear model, exploit it – use the exact solution rather than an iterative optimiser. When your model is nonlinear, the techniques of the next two lectures (optimisation and MCMC) become essential.

9. Bias-Variance Tradeoff

9.1 Underfitting and Overfitting

Consider fitting polynomials of increasing degree to data generated from a smooth function:

  • Degree 1 (linear): too simple to capture curvature. High bias, low variance – the model is systematically wrong but gives consistent answers across different noise realisations.
  • Degree 3 (cubic): captures the true structure well. A good balance.
  • Degree 10: starts to fit noise features. Low bias, high variance – the model can represent the truth, but is so flexible that it overfits noise.
  • Degree 19 (with $N = 20$ data points): passes through every data point and oscillates wildly between them.

Figure 6

Figure 6: Polynomial fits of degree 1, 3, 10, and 19 to 20 data points generated from a smooth function plus noise. Low-degree polynomials underfit (high bias); very high-degree polynomials overfit (high variance). The degree-3 polynomial captures the true function well.

9.2 Training Error vs. Test Error

The training error (the $\chi^2$ on the data used for fitting) always decreases as model complexity increases. But this is misleading – a model that memorises the training data will perform terribly on new data.

The test error (the $\chi^2$ on held-out data not used for fitting) decreases initially as the model becomes flexible enough to capture real structure, then increases as the model begins to fit noise. The optimal model complexity is where the test error is minimised.

9.3 Controlling Complexity

There are two ways to control model complexity:

  1. Model selection: choose the number of basis functions (polynomial degree, number of Fourier harmonics, etc.).
  2. Regularisation: keep many basis functions but penalise large coefficients.

Cross-validation provides a principled way to choose the regularisation strength $\lambda$ (or model order). The idea is to repeatedly hold out a fraction of the data, fit on the rest, and evaluate the prediction error on the held-out set. The value of $\lambda$ that minimises the average held-out error is the best choice.

Figure 7

Figure 7: Training error and test error as a function of model complexity (number of polynomial terms). Training error decreases monotonically. Test error reaches a minimum at intermediate complexity – this is the sweet spot. To the left is underfitting; to the right is overfitting.


10. Bonus: Linear Operators on Linear Models

One powerful consequence of working with linear models is that differentiation, integration, and convolution are linear operations – and when applied to a sum of basis functions, the result is also a sum of (transformed) basis functions with the same coefficients.

If your model is:

$$f(x) = \sum_{j=1}^{P} \theta_j \phi_j(x) = \mathbf{a}(x)^\top \boldsymbol{\theta}$$

then the derivative is:

$$f’(x) = \sum_{j=1}^{P} \theta_j \phi_j’(x) = \mathbf{a}’(x)^\top \boldsymbol{\theta}$$

The same parameters $\boldsymbol{\theta}$ appear in both expressions. This means that once you have fitted $\boldsymbol{\theta}$, you get the derivative (and its uncertainty) for free – just by differentiating the basis functions.

For example, with a polynomial model $f(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3$:

$$f’(x) = \theta_1 + 2\theta_2 x + 3\theta_3 x^2$$

The derivative design matrix $\mathbf{A}’$ has columns $[0, 1, 2x, 3x^2]$ instead of $[1, x, x^2, x^3]$.

# Derivative of a polynomial model from fitted coefficients
def derivative_polynomial(x, theta):
    """Compute the derivative of a polynomial model."""
    degree = len(theta) - 1
    # Derivative coefficients: k * theta_k for k = 1, ..., degree
    A_prime = np.column_stack([k * x**(k-1) for k in range(1, degree + 1)])
    return A_prime @ theta[1:]

# Compare: analytic derivative vs finite differences of noisy data
x_fine = np.linspace(0, 5, 200)
f_prime_analytic = derivative_polynomial(x_fine, theta_hat)  # smooth
f_prime_fd = np.gradient(y, x)  # finite differences of noisy data — noisy!

This extends to any linear operator $\mathcal{L}$: if you can compute $\mathcal{L}[\phi_j]$ for each basis function, then $\mathcal{L}[f] = \sum_j \theta_j \mathcal{L}[\phi_j]$, and you can build a new design matrix $\mathbf{A}_{\mathcal{L}}$ that gives the operator’s output at any point.

Figure 8

Figure 8: Top: a function fitted with a polynomial basis. Bottom: the derivative computed analytically from the basis coefficients (smooth curve) compared with the derivative estimated by finite differences (noisy). The analytic derivative from the linear model is smooth and exact – it does not amplify noise the way numerical differentiation does.

ℹ️
Physical constraints via linear operators. If your model must satisfy a differential equation or a conservation law, you can enforce this by applying the corresponding linear operator to the basis functions and adding the constraint as additional rows of the design matrix (or as a prior). This is used in spectral methods for solving differential equations and in constrained regression models in astrophysics.

11. Summary

A linear model is any model of the form $\mathbf{y} = \mathbf{A}\boldsymbol{\theta} + \boldsymbol{\varepsilon}$, where the design matrix $\mathbf{A}$ depends on the data but not on the parameters. Polynomials, Fourier series, RBFs, and PCA templates are all linear models.

The normal equations give the exact solution: $$\hat{\boldsymbol{\theta}} = \left(\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A}\right)^{-1}\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{y}$$ with parameter covariance $\boldsymbol{\Sigma}_\theta = (\mathbf{A}^\top\mathbf{C}^{-1}\mathbf{A})^{-1}$. This is simultaneously the MLE and the posterior mean under a flat prior.

Orthogonal bases decouple the normal equations, making the fit faster, more stable, and more interpretable. Fourier bases on regular grids are naturally orthogonal; polynomial bases are not. Spherical harmonics are the natural orthogonal basis for data on a sphere.

Numerical conditioning is critical for polynomial bases. Raw monomials produce catastrophically ill-conditioned Vandermonde matrices. Always normalise inputs to $[-1, 1]$ and prefer orthogonal polynomials (Chebyshev, Legendre) over monomials.

Underdetermined systems ($P > N$) have infinitely many solutions. Regularisation is essential.

Ridge regularisation is equivalent to a Gaussian prior on the parameters. It prevents overfitting, makes ill-conditioned problems well-conditioned, and has a clear Bayesian interpretation.

Convexity of the least-squares objective guarantees a unique global minimum with no risk of getting stuck in local minima. This is the fundamental computational advantage of linear models.

The bias-variance tradeoff governs model selection: too few parameters underfit, too many overfit. Cross-validation or regularisation provides a principled way to find the sweet spot.

Linear operators (derivatives, integrals, convolutions) applied to a linear model produce another linear model with the same coefficients – giving smooth, noise-free derivatives and enabling physical constraints.


Further Reading

  • Hogg, D. W. & Villar, S. (2021). Fitting very flexible models: Linear regression with large numbers of parameters. PASP 133, 093001. arXiv:2101.07256. [Key reference on using far more parameters than data with regularisation, numerical conditioning, and basis function choice]
  • Hogg, D. W., Bovy, J., & Lang, D. (2010). Data analysis recipes: Fitting a model to data. arXiv:1008.4686. [Sections 1–8 for the general linear model framework]
  • Press, W. H., et al. (2007). Numerical Recipes, Chapter 15. [Comprehensive treatment of least-squares fitting, SVD, and regularisation]
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 3. [The Bayesian perspective on linear regression, including regularisation and model selection]
  • VanderPlas, J. (2014). Frequentism and Bayesianism: A Python-driven Primer. arXiv:1411.5018. [Accessible comparison of frequentist and Bayesian approaches to linear models]
Last updated on