Mixture Models

Many astronomical datasets are not generated by a single process. A colour-magnitude diagram contains main-sequence stars, giants, and white dwarfs. A radial velocity survey picks up cluster members and field interlopers. A photometric catalogue records real sources and artefacts. In each case, the data are a mixture of distinct populations, and the scientific question is often about separating them — or at least about getting the right answer for one population without being corrupted by the others.

Mixture models provide the formal machinery for this decomposition. They introduce latent discrete variables — unobserved labels that indicate which component generated each data point — and then marginalise over those labels to produce a likelihood for the observed data. The Expectation-Maximisation (E-M) algorithm provides an elegant iterative scheme for fitting these models, and understanding how it works is the centrepiece of this lecture.

This lecture connects to several ideas you have already seen: generative models and DAGs (lecture 4), likelihood functions and optimisation (lectures 5 and 8), and hierarchical structure with partial pooling (lecture 12). Mixture models add a new ingredient — discrete latent variables — that opens up a qualitatively different class of problems.

1. Why Mixture Models?

Consider three scenarios from observational astronomy:

Stellar populations. You measure radial velocities for stars in the direction of a dwarf galaxy. Some stars are genuine members of the galaxy (clustered around $v \approx -120\text{km/s}$); others are foreground Milky Way stars with a broad velocity distribution. Fitting a single Gaussian to the combined sample gives you a mean and dispersion that describe neither population correctly.

Outlier modelling. You fit a straight line to photometric data, but a few points are catastrophically wrong — misidentified sources, reduction artefacts, cosmic ray hits. In lecture 6 we handled outliers by hand or with robust statistics. A mixture model provides a principled alternative: the data are generated by a "good" model (your line) plus a "bad" model (a broad background distribution), and the fit itself determines which points are which.

Source classification. A survey produces a catalogue of objects. Some are stars, some are galaxies, and the classifier is uncertain for faint sources. Rather than making a hard cut, a mixture model assigns each source a probability of belonging to each class, and downstream analyses can propagate that uncertainty.

In all these cases, the structure is the same: the observed data come from $K$ distinct generating processes, but you do not know which process produced which data point.

2. The Generative Story

A finite mixture model has a clean generative interpretation. To generate a single data point $y_i$:

Draw a component label $z_i$ from a categorical distribution with probabilities $\boldsymbol{\pi} = (\pi_1, \pi_2, \ldots, \pi_K)$, where $\pi_k \geq 0$ and $\sum_{k=1}^{K} \pi_k = 1$. These are the mixing weights (or mixing proportions).
Draw the data point from the component distribution: $y_i \sim p(y \mid \theta_{z_i})$, where $\theta_k$ are the parameters of component $k$.

The label $z_i$ is a latent variable — it is part of the generative process but is never observed. We do not know which component produced which data point; if we did, the problem would reduce to $K$ independent fitting problems.

The DAG for this model is:

$$\pi_k \to z_i \to y_i \leftarrow \theta_{z_i}$$

where the plate over $i$ runs from $1$ to $N$ (one latent label per data point) and the plate over $k$ runs from $1$ to $K$ (one set of parameters per component).

2.1 The Mixture Density

Marginalising over the latent label $z_i$, the probability of observing $y_i$ is:

$$p(y_i \mid \boldsymbol{\theta}, \boldsymbol{\pi}) = \sum_{k=1}^{K} \pi_k p(y_i \mid \theta_k)$$

This is the mixture density: a weighted sum of component densities. Each component $p(y_i \mid \theta_k)$ could be any parametric family — Gaussian, Poisson, Student-$t$, etc. The mixture inherits the family of its components, but the combined density can have shapes (multimodality, skewness, heavy tails) that no single component could produce.

Figure 1: A two-component Gaussian mixture (purple) decomposed into its individual components (dashed). The mixture density is simply $\pi_1 \mathcal{N}(\mu_1, \sigma_1^2) + \pi_2 \mathcal{N}(\mu_2, \sigma_2^2)$. Neither component alone describes the data; the mixture does.

ℹ️

Soft vs. hard assignment. A clustering algorithm like $k$-means assigns each point to exactly one cluster — a hard assignment. A mixture model assigns each point a probability of belonging to each component. This soft assignment is more honest: for a data point sitting between two clusters, it correctly reflects the ambiguity rather than forcing a binary choice.

3. Gaussian Mixture Models

The most common mixture model uses Gaussian components. A Gaussian mixture model (GMM) with $K$ components in $D$ dimensions has the density:

$$p(y \mid \boldsymbol{\theta}) = \sum_{k=1}^{K} \pi_k\mathcal{N}(y \mid \mu_k, \Sigma_k)$$

where:

$$\mathcal{N}(y \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{D/2} |\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}(y - \mu_k)^\top \Sigma_k^{-1} (y - \mu_k)\right)$$

The full parameter set is $\boldsymbol{\theta} = \\{\pi_k, \mu_k, \Sigma_k\\}_{k=1}^{K}$. For a $K$-component model in $D$ dimensions, this is:

$K - 1$ free mixing weights (since they sum to 1)
$K \times D$ mean parameters
$K \times D(D+1)/2$ covariance parameters (symmetric matrices)

For $K = 3$ and $D = 2$, that is already 17 parameters. The parameter count grows quickly.

3.1 A Concrete 1D Example

Let us work with a specific dataset throughout this lecture. We generate $N = 300$ data points from a three-component Gaussian mixture:

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# True parameters
true_pi = np.array([0.3, 0.5, 0.2])
true_mu = np.array([-2.0, 1.5, 5.0])
true_sigma = np.array([0.8, 1.0, 0.6])
N = 300

# Generate data from the mixture
z = np.random.choice(3, size=N, p=true_pi)
y = np.array([np.random.normal(true_mu[z_i], true_sigma[z_i]) for z_i in z])

This gives us a dataset where the three components are partially overlapping — exactly the situation where hard classification fails and soft assignment is needed.

Figure 2: Histogram of 300 data points drawn from a three-component Gaussian mixture. The true component densities (dashed) and the overall mixture density (solid) are overlaid. The middle component dominates, but all three contribute to the observed distribution.

4. The Likelihood and Why Direct Optimisation Is Hard

For $N$ independent observations, the log-likelihood of a Gaussian mixture model is:

$$\log \mathcal{L}(\boldsymbol{\theta}) = \sum_{i=1}^{N} \log \left[ \sum_{k=1}^{K} \pi_k\mathcal{N}(y_i \mid \mu_k, \Sigma_k) \right]$$

This is a sum of logs of sums — and that nested structure is the source of all the difficulty. Compare this to the log-likelihood of a single Gaussian, which is a sum of simple quadratic terms. Here, the logarithm sits outside a sum over components, and there is no algebraic simplification that separates the parameters of different components.

4.1 Non-Convexity

The mixture log-likelihood is non-convex. It has multiple local maxima, because any permutation of the component labels gives the same likelihood. A mixture with $(\pi_1, \mu_1, \sigma_1)$ and $(\pi_2, \mu_2, \sigma_2)$ has the same likelihood as one with $(\pi_2, \mu_2, \sigma_2)$ and $(\pi_1, \mu_1, \sigma_1)$. For $K$ components, there are $K!$ equivalent modes — this is the label-switching problem.

Beyond label switching, there can be genuinely distinct local optima corresponding to qualitatively different decompositions of the data.

4.2 Singularities

There is a more insidious problem. If a Gaussian component collapses onto a single data point — $\mu_k = y_i$ and $\sigma_k \to 0$ — then $\mathcal{N}(y_i \mid \mu_k, \sigma_k) \to \infty$ and the log-likelihood diverges to $+\infty$. The global maximum of the likelihood is not a useful solution; it corresponds to a degenerate fit where one component has zero variance and explains exactly one data point.

⚠️

The likelihood is unbounded for Gaussian mixtures. This is a fundamental pathology, not a numerical accident. It means that maximum likelihood estimation for GMMs is technically ill-posed without regularisation or a prior. In practice, we either add a small floor to the variance, use Bayesian priors, or rely on E-M (which avoids the singularity in practice because it updates all points simultaneously).

4.3 Constrained Parameters

The mixing weights must satisfy $\pi_k \geq 0$ and $\sum_k \pi_k = 1$. The covariance matrices $\Sigma_k$ must be positive definite. These constraints make generic gradient-based optimisation awkward — you need either constrained optimisation or reparameterisations. The E-M algorithm handles all of these constraints naturally, which is one reason it became the standard approach.

5. The Expectation-Maximisation Algorithm

The Expectation-Maximisation (E-M) algorithm is an iterative method for finding (local) maximum likelihood estimates in models with latent variables. It was formalised by Dempster, Laird, and Rubin (1977), though the underlying idea had been used informally for decades.

5.1 The Core Intuition

The mixture model presents a chicken-and-egg problem:

If you knew the component assignments $z_i$, fitting the parameters would be trivial: just compute the mean and variance of the points assigned to each component.
If you knew the parameters, computing the assignments would be easy: just evaluate which component most likely generated each point using Bayes' theorem.

E-M resolves this by alternating between the two steps. Start with a guess for the parameters, use it to compute soft assignments, then use those assignments to update the parameters. Repeat until convergence.

The beautiful property is that each iteration is guaranteed to increase the log-likelihood (or leave it unchanged). This means E-M always converges to a local maximum — though not necessarily the global one.

5.2 The E-Step: Computing Responsibilities

Given current parameter estimates $\boldsymbol{\theta}^{(t)} = \\{\pi_k^{(t)}, \mu_k^{(t)}, \sigma_k^{(t)}\\}$, compute the responsibility of component $k$ for data point $i$:

$$r_{ik} = \frac{\pi_k^{(t)}\mathcal{N}(y_i \mid \mu_k^{(t)}, \sigma_k^{2(t)})}{\sum_{j=1}^{K} \pi_j^{(t)}\mathcal{N}(y_i \mid \mu_j^{(t)}, \sigma_j^{2(t)})}$$

This is simply Bayes' theorem applied to each data point. The numerator is the joint probability of component $k$ generating data point $y_i$ (prior times likelihood), and the denominator is the marginal probability of $y_i$ under the current model. The result is the posterior probability that data point $i$ was generated by component $k$.

The responsibilities satisfy $r_{ik} \geq 0$ and $\sum_k r_{ik} = 1$ for each data point — they form a proper probability distribution over components for each observation.

5.3 The M-Step: Updating Parameters

Given the responsibilities $r_{ik}$, update the parameters by treating the responsibilities as fractional counts. Define the effective number of points assigned to component $k$:

$$N_k = \sum_{i=1}^{N} r_{ik}$$

Then the updated parameters are:

$$\pi_k^{(t+1)} = \frac{N_k}{N}$$$$\mu_k^{(t+1)} = \frac{1}{N_k} \sum_{i=1}^{N} r_{ik}y_i$$$$\sigma_k^{2(t+1)} = \frac{1}{N_k} \sum_{i=1}^{N} r_{ik}(y_i - \mu_k^{(t+1)})^2$$

These are just the weighted mean, weighted variance, and fraction formulae — the same as you would use if each data point had a fractional membership in each component rather than a hard assignment.

ℹ️

Why does E-M respect the constraints automatically? The updated mixing weights $\pi_k^{(t+1)} = N_k / N$ are always non-negative and sum to 1, because the responsibilities $r_{ik}$ are non-negative and $\sum_k N_k = \sum_k \sum_i r_{ik} = \sum_i \sum_k r_{ik} = \sum_i 1 = N$. The updated variances are always non-negative because they are weighted sums of squares. No projection or reparameterisation needed.

5.4 The Complete Algorithm

Putting it together:

Initialise the parameters $\\{\pi_k^{(0)}, \mu_k^{(0)}, \sigma_k^{(0)}\\}_{k=1}^K$
E-step: Compute responsibilities $r_{ik}$ using the current parameters
M-step: Update parameters using the responsibilities
Check convergence: If the change in log-likelihood $|\Delta \log \mathcal{L}| < \epsilon$, stop. Otherwise, go to step 2.

def log_likelihood(y, pi, mu, sigma):
    """Compute log-likelihood of a 1D Gaussian mixture."""
    N = len(y)
    K = len(pi)
    # L[i] = sum_k pi_k * N(y_i | mu_k, sigma_k)
    L = np.zeros(N)
    for k in range(K):
        L += pi[k] * norm.pdf(y, mu[k], sigma[k])
    return np.sum(np.log(L))


def e_step(y, pi, mu, sigma):
    """Compute responsibilities r[i, k]."""
    N = len(y)
    K = len(pi)
    r = np.zeros((N, K))
    for k in range(K):
        r[:, k] = pi[k] * norm.pdf(y, mu[k], sigma[k])
    # Normalise each row to sum to 1
    r /= r.sum(axis=1, keepdims=True)
    return r


def m_step(y, r):
    """Update parameters given responsibilities."""
    N, K = r.shape
    N_k = r.sum(axis=0)                            # effective counts
    pi_new = N_k / N                                # mixing weights
    mu_new = (r.T @ y) / N_k                        # weighted means
    sigma_new = np.zeros(K)
    for k in range(K):
        diff = y - mu_new[k]
        sigma_new[k] = np.sqrt((r[:, k] @ diff**2) / N_k[k])  # weighted std
    return pi_new, mu_new, sigma_new


def fit_gmm(y, K, max_iter=200, tol=1e-6):
    """Fit a K-component 1D GMM using E-M."""
    # Initialise with random means from data, equal weights, unit variance
    rng = np.random.default_rng(0)
    mu = rng.choice(y, size=K, replace=False)
    sigma = np.ones(K) * np.std(y)
    pi = np.ones(K) / K

    ll_history = []
    for iteration in range(max_iter):
        # E-step
        r = e_step(y, pi, mu, sigma)
        # M-step
        pi, mu, sigma = m_step(y, r)
        # Track log-likelihood
        ll = log_likelihood(y, pi, mu, sigma)
        ll_history.append(ll)

        if iteration > 0 and abs(ll_history[-1] - ll_history[-2]) < tol:
            print(f"Converged after {iteration + 1} iterations.")
            break

    return pi, mu, sigma, r, ll_history

5.5 Worked Example: Step by Step

Let us trace through the first few iterations of E-M on our three-component dataset from Section 3.1. This is where understanding solidifies, so follow each step carefully.

Initialisation. Suppose we initialise with:

K = 3
mu = np.array([-1.0, 2.0, 4.0])     # initial guesses for means
sigma = np.array([1.5, 1.5, 1.5])    # initial guesses for std devs
pi = np.array([1/3, 1/3, 1/3])       # equal mixing weights

These are deliberately imprecise — the means are offset from the true values, the variances are too large, and the mixing weights are uniform.

Iteration 1, E-step. For each data point $y_i$, we compute the responsibility of each component. Take a specific point, say $y_i = -2.1$ (which was actually generated by component 1 with true mean $-2.0$):

$$r_{i,1} = \frac{\pi_1 \mathcal{N}(-2.1 \mid -1.0, 1.5)}{\sum_{k=1}^3 \pi_k \mathcal{N}(-2.1 \mid \mu_k, 1.5)}$$

Computing each numerator:

y_i = -2.1
# Numerator for each component
num_1 = (1/3) * norm.pdf(-2.1, -1.0, 1.5)  # = 0.333 * 0.2218 = 0.0739
num_2 = (1/3) * norm.pdf(-2.1, 2.0, 1.5)   # = 0.333 * 0.0191 = 0.00636
num_3 = (1/3) * norm.pdf(-2.1, 4.0, 1.5)   # = 0.333 * 0.000398 = 0.000133
denom = num_1 + num_2 + num_3               # = 0.0804

r_1 = num_1 / denom  # = 0.919
r_2 = num_2 / denom  # = 0.079
r_3 = num_3 / denom  # = 0.002

Even with imprecise initial parameters, the E-step correctly identifies that $y_i = -2.1$ most likely belongs to component 1. A point at $y_i = 3.0$, by contrast, would have high responsibility for component 2.

Let us verify this numerically for a few representative points:

test_points = np.array([-2.1, 1.5, 5.2, 0.0])
r_test = e_step(test_points, pi, mu, sigma)

print("        y_i     r_1      r_2      r_3")
for i, yi in enumerate(test_points):
    print(f"  y = {yi:5.1f}:  {r_test[i,0]:.4f}   {r_test[i,1]:.4f}   {r_test[i,2]:.4f}")

The output is:

        y_i     r_1      r_2      r_3
  y = -2.1:  0.9193   0.0791   0.0017
  y =  1.5:  0.1525   0.5892   0.2583
  y =  5.2:  0.0010   0.1099   0.8891
  y =  0.0:  0.3529   0.5052   0.1419

Notice how $y = 0.0$ sits between components 1 and 2 and receives substantial responsibility from both — this is the soft assignment at work. A hard clustering algorithm would force this point into one group; E-M honestly distributes the credit.

Iteration 1, M-step. After computing responsibilities for all 300 data points, we update the parameters:

r_full = e_step(y, pi, mu, sigma)
pi_new, mu_new, sigma_new = m_step(y, r_full)

print(f"Updated weights: {pi_new}")
print(f"Updated means:   {mu_new}")
print(f"Updated sigmas:  {sigma_new}")

Even after a single iteration, the means shift toward the true data concentrations, the weights adjust away from uniform, and the variances begin to reflect the actual spread of each component.

Subsequent iterations. Each iteration sharpens the responsibilities and refines the parameters. The log-likelihood increases monotonically:

# Run full E-M
pi_fit, mu_fit, sigma_fit, r_final, ll_history = fit_gmm(y, K=3, max_iter=100)

print(f"\nFinal parameters:")
for k in range(3):
    print(f"  Component {k+1}: π={pi_fit[k]:.3f}, μ={mu_fit[k]:.3f}, σ={sigma_fit[k]:.3f}")
print(f"\nTrue parameters:")
for k in range(3):
    print(f"  Component {k+1}: π={true_pi[k]:.3f}, μ={true_mu[k]:.3f}, σ={true_sigma[k]:.3f}")

A typical run converges in 20-40 iterations. The recovered parameters will be close to the true values (up to label permutation), with deviations due to finite sample size.

Figure 3: E-M convergence on the three-component dataset. Left: the log-likelihood increases monotonically with each iteration, converging after roughly 30 steps. Right: the fitted mixture density (solid) compared to the true density (dashed). The recovered parameters are close to the true values.

5.6 Watching E-M Iterate

To build deeper intuition, it helps to visualise the state of the model at each iteration. The following code runs E-M and records snapshots:

import matplotlib.pyplot as plt

def run_em_with_snapshots(y, K, mu_init, sigma_init, pi_init, n_iter=10):
    """Run E-M and save parameter snapshots at each iteration."""
    mu, sigma, pi = mu_init.copy(), sigma_init.copy(), pi_init.copy()
    snapshots = [(pi.copy(), mu.copy(), sigma.copy())]

    for t in range(n_iter):
        r = e_step(y, pi, mu, sigma)
        pi, mu, sigma = m_step(y, r)
        snapshots.append((pi.copy(), mu.copy(), sigma.copy()))
    return snapshots

mu_init = np.array([-1.0, 2.0, 4.0])
sigma_init = np.array([1.5, 1.5, 1.5])
pi_init = np.array([1/3, 1/3, 1/3])

snapshots = run_em_with_snapshots(y, K=3, mu_init=mu_init,
                                  sigma_init=sigma_init,
                                  pi_init=pi_init, n_iter=20)

# Plot iterations 0, 1, 3, 10, 20
x_grid = np.linspace(y.min() - 2, y.max() + 2, 500)
for t in [0, 1, 3, 10, 20]:
    pi_t, mu_t, sigma_t = snapshots[t]
    density = sum(pi_t[k] * norm.pdf(x_grid, mu_t[k], sigma_t[k]) for k in range(3))
    plt.plot(x_grid, density, label=f'Iteration {t}')
plt.hist(y, bins=40, density=True, alpha=0.3, color='gray')
plt.legend()
plt.xlabel('y')
plt.ylabel('Density')
plt.title('E-M convergence')
plt.show()

Figure 4: Snapshots of the fitted mixture density at iterations 0, 1, 3, 10, and 20. The initial guess (iteration 0) is a poor fit; by iteration 10 the model has essentially converged to the correct density.

6. Convergence and Practical Issues

6.1 Monitoring Convergence

E-M is guaranteed to increase the log-likelihood at each step, but it does not tell you how close you are to the maximum. In practice, we monitor the change in log-likelihood and stop when it falls below a threshold:

$$|\log \mathcal{L}^{(t+1)} - \log \mathcal{L}^{(t)}| < \epsilon$$

A typical choice is $\epsilon = 10^{-6}$, but this depends on the problem. It is good practice to also impose a maximum number of iterations as a safeguard.

# Convergence plot
plt.plot(ll_history)
plt.xlabel('Iteration')
plt.ylabel('Log-likelihood')
plt.title('E-M convergence: log-likelihood vs. iteration')
plt.show()

The log-likelihood curve should be monotonically increasing and plateau. If it oscillates, there is a bug in your implementation.

6.2 Local Optima and Initialisation

E-M converges to a local maximum of the likelihood, not necessarily the global one. Different initialisations can lead to different solutions with different log-likelihoods. This is not a bug — it is a fundamental consequence of the non-convexity of the mixture likelihood.

The standard practical approach is:

Run E-M multiple times with different random initialisations
Keep the solution with the highest log-likelihood

best_ll = -np.inf
best_params = None

for trial in range(20):
    rng = np.random.default_rng(trial)
    mu_init = rng.choice(y, size=3, replace=False)
    sigma_init = np.ones(3) * np.std(y) * 0.5
    pi_init = np.ones(3) / 3

    pi_t, mu_t, sigma_t, r_t, ll_t = fit_gmm(y, K=3, max_iter=200)
    if ll_t[-1] > best_ll:
        best_ll = ll_t[-1]
        best_params = (pi_t, mu_t, sigma_t)

print(f"Best log-likelihood: {best_ll:.2f}")

Other initialisation strategies include:

K-means initialisation: run $k$-means clustering first, then use the cluster means and variances to initialise E-M. This is the default in scikit-learn's GaussianMixture.
K-means++: a smarter seeding that spreads initial centres apart
Random partition: randomly assign each data point to a component, then compute the M-step to get initial parameters

⚠️

Initialisation matters more than you think. For well-separated components, almost any reasonable initialisation works. For overlapping components — which is the scientifically interesting case — poor initialisation can lead to solutions where one component "eats" another, producing a two-component fit when three components are present. Always run multiple restarts.

6.3 Degenerate Solutions

If a component's effective count $N_k$ drops to zero (or near zero), its variance can collapse and the log-likelihood can diverge. Practical defences:

Variance floor: enforce $\sigma_k \geq \sigma_\text{min}$ for some small positive value
Restart: if $N_k < 1$, reinitialise that component from a random data point
Bayesian regularisation: place priors on the parameters (discussed in Section 8)

7. Multivariate Gaussian Mixtures

The 1D case extends naturally to $D$ dimensions. The E-step is identical in structure — just replace the 1D Gaussian PDF with the multivariate Gaussian PDF:

$$\mathcal{N}(y \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{D/2}|\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}(y - \mu_k)^\top \Sigma_k^{-1}(y - \mu_k)\right)$$

The M-step updates become:

$$\mu_k^{(t+1)} = \frac{1}{N_k}\sum_{i=1}^{N} r_{ik}y_i$$$$\Sigma_k^{(t+1)} = \frac{1}{N_k}\sum_{i=1}^{N} r_{ik}(y_i - \mu_k^{(t+1)})(y_i - \mu_k^{(t+1)})^\top$$

The key difference from the 1D case is that the covariance matrix $\Sigma_k$ captures correlations between dimensions within each component. This allows each component to have a different orientation, shape, and size — an elliptical cluster rather than a circular one.

from scipy.stats import multivariate_normal

def e_step_2d(Y, pi, mu, Sigma):
    """E-step for multivariate GMM. Y is (N, D)."""
    N = Y.shape[0]
    K = len(pi)
    r = np.zeros((N, K))
    for k in range(K):
        r[:, k] = pi[k] * multivariate_normal.pdf(Y, mu[k], Sigma[k])
    r /= r.sum(axis=1, keepdims=True)
    return r

def m_step_2d(Y, r):
    """M-step for multivariate GMM."""
    N, D = Y.shape
    K = r.shape[1]
    N_k = r.sum(axis=0)
    pi_new = N_k / N
    mu_new = np.zeros((K, D))
    Sigma_new = np.zeros((K, D, D))
    for k in range(K):
        mu_new[k] = (r[:, k] @ Y) / N_k[k]
        diff = Y - mu_new[k]  # (N, D)
        Sigma_new[k] = (diff.T * r[:, k]) @ diff / N_k[k]
    return pi_new, mu_new, Sigma_new

7.1 Astronomical Example: Colour-Magnitude Diagram

A classic application in stellar astronomy is decomposing a colour-magnitude diagram into distinct populations. In a globular cluster, you might see the main sequence, the red giant branch, and horizontal branch stars — each forming a distinct locus in colour-magnitude space. A 2D GMM can separate these populations probabilistically.

Figure 5: A 2D Gaussian mixture model fitted to simulated colour-magnitude data. Each ellipse shows the $2\sigma$ contour of one component. The shading indicates the responsibility-weighted assignment of each data point.

8. Connection to the Bayesian Approach

8.1 E-M as Coordinate Ascent

E-M can be understood as coordinate ascent on a lower bound of the log-likelihood. Define the evidence lower bound (ELBO):

$$\text{ELBO}(q, \theta) = \sum_{i=1}^{N} \sum_{k=1}^{K} q(z_i = k) \left[\log p(y_i, z_i = k \mid \theta) - \log q(z_i = k)\right]$$

where $q(z_i = k)$ is an approximate posterior over the latent labels. The E-step optimises the ELBO with respect to $q$ (setting $q(z_i = k) = r_{ik}$, the true posterior), and the M-step optimises it with respect to $\theta$. This alternation is guaranteed to increase the log-likelihood because the ELBO is a lower bound on $\log \mathcal{L}$, and each step tightens that bound.

8.2 Full Bayesian Treatment with MCMC

E-M gives point estimates of the parameters. A fully Bayesian approach places priors on all parameters and samples the joint posterior:

$$p(\boldsymbol{\theta}, \mathbf{z} \mid \mathbf{y}) \propto \prod_{i=1}^{N} p(y_i \mid z_i, \boldsymbol{\theta})p(z_i \mid \boldsymbol{\pi})\timesp(\boldsymbol{\pi}) \prod_{k=1}^{K} p(\mu_k) p(\sigma_k)$$

This is conceptually straightforward but computationally expensive: you need to sample both the continuous parameters $(\pi_k, \mu_k, \sigma_k)$ and the discrete latent labels $z_i$. The discrete variables make standard HMC inapplicable; you typically need Gibbs sampling (alternating between sampling $z_i$ given $\theta$ and sampling $\theta$ given $z$) or you marginalise out the $z_i$ and sample only the continuous parameters.

The Bayesian approach has three advantages:

Uncertainty quantification: you get posterior distributions over all parameters, not just point estimates
Regularisation: priors prevent the variance singularity that plagues maximum likelihood
Model comparison: the marginal likelihood $p(\mathbf{y} \mid K)$ lets you compare models with different numbers of components (lecture 15)

And one practical disadvantage: it is much slower, especially for large datasets.

ℹ️

When to use E-M vs. MCMC. Use E-M when you have a large dataset ($N > 10{,}000$), when you need a fast answer, or when you are exploring different values of $K$ and need many fits quickly. Use MCMC when you need full uncertainty quantification, when $N$ is moderate, or when you want to compare models using marginal likelihoods. In practice, it is common to use E-M for initialisation and exploration, then switch to MCMC for the final analysis.

8.3 A PyMC Example

For moderate-sized problems, a Bayesian mixture model can be implemented in PyMC:

import pymc as pm

with pm.Model() as mixture_model:
    # Priors on mixing weights (Dirichlet is the conjugate prior)
    w = pm.Dirichlet("w", a=np.ones(3))

    # Priors on component means and standard deviations
    mu = pm.Normal("mu", mu=0, sigma=10, shape=3)
    sigma = pm.HalfNormal("sigma", sigma=5, shape=3)

    # Mixture likelihood
    likelihood = pm.NormalMixture("obs", w=w, mu=mu, sigma=sigma, observed=y)

    # Sample
    trace = pm.sample(2000, tune=1000, cores=2, random_seed=42)

The pm.NormalMixture distribution marginalises over the latent labels internally, so you only sample the continuous parameters. The Dirichlet prior on the weights is the natural conjugate prior for categorical probabilities.

9. Choosing the Number of Components

So far we have assumed that $K$ is known. In practice, it rarely is. How many stellar populations are present? How many foreground contaminants? The choice of $K$ is a model comparison problem, and we will treat it properly in lecture 15 using the Bayesian evidence and information criteria. Here, we preview the main ideas.

9.1 The Problem with More Components

Adding more components will always increase the log-likelihood (or at least not decrease it), because a $K+1$-component model can always reproduce a $K$-component model by setting one weight to zero. You cannot use the log-likelihood alone to choose $K$.

9.2 Information Criteria

The Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) penalise model complexity:

$$\text{BIC} = -2 \log \mathcal{L}_\text{max} + p \log N$$

$$\text{AIC} = -2 \log \mathcal{L}_\text{max} + 2p$$

where $p$ is the number of free parameters and $N$ is the number of data points. Lower values are preferred. For a 1D GMM with $K$ components, $p = 3K - 1$ (means, variances, and weights minus the sum-to-one constraint).

def compute_bic(y, K, ll_max):
    """BIC for a 1D GMM with K components."""
    N = len(y)
    p = 3 * K - 1  # free parameters
    return -2 * ll_max + p * np.log(N)

# Fit models with K = 1, 2, ..., 6 and compute BIC
bic_values = []
for K in range(1, 7):
    _, _, _, _, ll = fit_gmm(y, K=K, max_iter=200)
    bic_values.append(compute_bic(y, K, ll[-1]))

print("K | BIC")
for K, bic in enumerate(bic_values, 1):
    print(f"{K} | {bic:.1f}")

Figure 6: BIC as a function of the number of components $K$. The minimum at $K = 3$ correctly identifies the true number of components in our simulated dataset. Underfitting ($K < 3$) produces a large BIC; overfitting ($K > 3$) incurs a complexity penalty that outweighs the marginal improvement in fit.

ℹ️

BIC vs. AIC. The BIC penalises complexity more strongly than the AIC (because $\log N > 2$ for $N \geq 8$). For mixture models, the BIC is generally preferred because it is consistent — as $N \to \infty$, the BIC selects the correct model with probability 1. The AIC tends to overfit, selecting too many components. We will discuss the theoretical foundations in lecture 15.

10. Practical Example: Separating Stellar Populations

Let us bring everything together with an astronomical application. Suppose you have radial velocity measurements for 500 stars in the direction of a dwarf spheroidal galaxy. The sample contains galaxy members (clustered around $v_\text{gal} = -120 \text{km/s}$ with dispersion $\sigma_\text{gal} = 8 \text{km/s}$) and Milky Way foreground contamination (broadly distributed around $v_\text{MW} = 0 \text{km/s}$ with dispersion $\sigma_\text{MW} = 50 \text{km/s}$).

# Simulate the dataset
np.random.seed(123)
N_gal = 150    # galaxy members
N_mw = 350     # Milky Way foreground
v_gal = np.random.normal(-120, 8, N_gal)
v_mw = np.random.normal(0, 50, N_mw)
v_all = np.concatenate([v_gal, v_mw])
np.random.shuffle(v_all)

# Fit a 2-component GMM
pi_fit, mu_fit, sigma_fit, r_fit, ll_fit = fit_gmm(v_all, K=2, max_iter=200)

# Sort components by mean (for consistent labelling)
order = np.argsort(mu_fit)
pi_fit, mu_fit, sigma_fit = pi_fit[order], mu_fit[order], sigma_fit[order]
r_fit = r_fit[:, order]

print(f"Galaxy component:   π={pi_fit[0]:.3f}, μ={mu_fit[0]:.1f}, σ={sigma_fit[0]:.1f}")
print(f"Foreground component: π={pi_fit[1]:.3f}, μ={mu_fit[1]:.1f}, σ={sigma_fit[1]:.1f}")

The recovered parameters should be close to the true values: $\pi_\text{gal} \approx 0.30$, $\mu_\text{gal} \approx -120$, $\sigma_\text{gal} \approx 8$. The velocity dispersion of the galaxy ($\sigma_\text{gal}$) is a key scientific output — it constrains the dark matter content of the dwarf galaxy.

10.1 Membership Probabilities

The responsibility $r_{i,\text{gal}}$ gives the probability that star $i$ is a galaxy member. Stars with $r_{i,\text{gal}} > 0.9$ are confident members; those with $r_{i,\text{gal}} \in [0.3, 0.7]$ are ambiguous. Rather than making a hard cut, you can weight downstream analyses (e.g., a dynamical mass estimate) by these membership probabilities.

# Identify confident members
member_prob = r_fit[:, 0]  # probability of being in the galaxy component
confident_members = member_prob > 0.9
ambiguous = (member_prob > 0.1) & (member_prob < 0.9)

print(f"Confident galaxy members: {confident_members.sum()}")
print(f"Ambiguous stars: {ambiguous.sum()}")
print(f"Confident foreground: {(member_prob < 0.1).sum()}")

Figure 7: Top: histogram of radial velocities with the two fitted Gaussian components overlaid. Bottom: membership probability for the galaxy component as a function of velocity. Stars near $v = -120 \text{km/s}$ have membership probability near 1; those at intermediate velocities have uncertain membership.

10.2 Propagating Membership Uncertainty

When computing the velocity dispersion of the galaxy, you should use the membership probabilities as weights rather than making a hard cut:

# Membership-weighted velocity dispersion
w = r_fit[:, 0]  # galaxy membership probabilities
mu_gal_weighted = np.average(v_all, weights=w)
sigma_gal_weighted = np.sqrt(np.average((v_all - mu_gal_weighted)**2, weights=w))

print(f"Weighted mean velocity: {mu_gal_weighted:.1f} km/s")
print(f"Weighted velocity dispersion: {sigma_gal_weighted:.1f} km/s")

This is more robust than a hard membership cut because it does not introduce an arbitrary threshold, and it naturally down-weights contaminating foreground stars.

11. Using scikit-learn

For production work, you should use a well-tested implementation rather than writing your own. The scikit-learn library provides GaussianMixture:

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, n_init=10, random_state=42)
gmm.fit(y.reshape(-1, 1))

print(f"Means: {gmm.means_.flatten()}")
print(f"Variances: {gmm.covariances_.flatten()}")
print(f"Weights: {gmm.weights_}")
print(f"BIC: {gmm.bic(y.reshape(-1, 1)):.1f}")
print(f"Converged: {gmm.converged_}")
print(f"Iterations: {gmm.n_iter_}")

# Responsibilities
r_sklearn = gmm.predict_proba(y.reshape(-1, 1))

The n_init=10 argument runs E-M 10 times with different initialisations and keeps the best solution — implementing the multi-restart strategy from Section 6.2 automatically.

⚠️

The covariance_type parameter matters. By default, GaussianMixture fits full covariance matrices. For high-dimensional data with few points per component, this can lead to poorly estimated covariances. Options include 'full' (default), 'tied' (all components share one covariance matrix), 'diag' (diagonal covariances — no correlations), and 'spherical' (isotropic — one variance per component). Choosing the right type is a modelling decision: you are specifying the shape of each component.

12. Mixture Models as Outlier Models

One of the most useful applications of mixture models is outlier modelling — and you have already seen the motivation in lecture 5. Instead of manually identifying and removing bad data points, you can model the data as a mixture of:

A signal component: the model you actually care about (e.g., a straight line)
A background component: a broad distribution that captures outliers

For a line-fitting problem with Gaussian outliers:

$$p(y_i \mid \theta) = (1 - f_\text{out})\mathcal{N}(y_i \mid mx_i + b, \sigma_i^2) + f_\text{out}\mathcal{N}(y_i \mid \mu_\text{out}, \sigma_\text{out}^2)$$

where $f_\text{out}$ is the outlier fraction, and $(\mu_\text{out}, \sigma_\text{out})$ describe the outlier distribution. This is exactly the Hogg, Bovy & Lang (2010) approach from lecture 6 — now you can see it as a special case of a two-component mixture model.

The E-M algorithm applied here alternates between:

E-step: compute the probability that each point is an outlier vs. a good measurement
M-step: refit the line using only the good measurements (weighted by their inlier probability)

This is far more principled than sigma-clipping or manual rejection because it lets the data determine which points are outliers, and it propagates the uncertainty in that determination into the parameter estimates.

Figure 8: A line-fitting problem with three outliers. Left: the best-fit line ignoring outliers (dashed) vs. the mixture model fit (solid). Right: the inlier probability $1 - r_{i,\text{out}}$ for each data point. The three outliers are correctly identified with inlier probability near zero, and the fitted line is robust to their presence.

Summary

Mixture models decompose data into $K$ overlapping components, each with its own parameters. The generative story introduces latent discrete labels $z_i$ that indicate which component produced each data point. Marginalising over these labels gives the mixture density $p(y) = \sum_k \pi_k p(y \mid \theta_k)$.

The E-M algorithm fits mixture models by alternating between computing soft assignments (E-step: responsibilities via Bayes' theorem) and updating parameters (M-step: weighted statistics). Each iteration is guaranteed to increase the log-likelihood. The algorithm is simple to implement, handles parameter constraints naturally, and requires no gradient computation or step-size tuning.

E-M finds local optima, not global ones. The mixture likelihood is non-convex with $K!$ equivalent modes from label permutations plus potentially distinct local maxima. Multi-restart initialisation is essential.

The Bayesian approach places priors on all parameters and samples the joint posterior with MCMC. This gives full uncertainty quantification and enables model comparison via the marginal likelihood, but is computationally more expensive.

Choosing $K$ requires a model comparison criterion. The BIC penalises complexity and is consistent for mixture models. We will develop this topic fully in lecture 15.

Practical applications include separating stellar populations, classifying sources, handling outliers, and modelling heterogeneous datasets. The membership probabilities from the E-step are directly useful scientific outputs — they propagate classification uncertainty into downstream analyses.