Latent Variable Models

Lecture 13 introduced latent discrete variables — unobserved labels that assign each data point to a mixture component. This lecture introduces latent continuous variables: unobserved quantities that drive the correlations you see in high-dimensional data. The central idea is dimensionality reduction — the possibility that a dataset living in $D$ dimensions is actually controlled by a much smaller number of $K \ll D$ hidden factors.

This idea is everywhere in astronomy. Stellar spectra have thousands of wavelength pixels, but the variation between stars is dominated by a handful of physical parameters: effective temperature, surface gravity, metallicity. Galaxy spectral energy distributions span dozens of photometric bands, but a few latent factors — stellar mass, star formation history, dust — explain most of the variance. If you can discover those latent factors from the data alone, without knowing the physics in advance, you have a powerful tool for exploration, compression, and discovery.

We will build up from matrix factorisation (the algebraic backbone), through Principal Component Analysis (the workhorse), to nonlinear embedding methods (t-SNE and UMAP) that reveal structure PCA cannot see. Along the way, the recurring theme is the tension between linear methods that are interpretable and globally faithful, and nonlinear methods that capture local structure but can mislead.

1. The Latent Variable Idea

1.1 From Mixture Models to Continuous Latent Variables

In a mixture model, each data point $y_i$ is associated with a discrete latent variable $z_i \in \{1, \ldots, K\}$. The generative story is: draw a label, then draw the observation conditioned on that label. Now replace the discrete label with a continuous latent vector $\mathbf{z}_i \in \mathbb{R}^K$:

Draw a latent vector: $\mathbf{z}_i \sim p(\mathbf{z})$
Generate the observation: $\mathbf{y}_i \sim p(\mathbf{y} \mid \mathbf{z}_i)$

If $K < D$, the model asserts that the high-dimensional observations are generated by a low-dimensional process. The latent variables $\mathbf{z}_i$ are the "true" coordinates; the observations $\mathbf{y}_i$ are a noisy, high-dimensional reflection of them.

1.2 Why This Matters

Consider a spectroscopic survey that measures flux at $D = 4000$ wavelength pixels for each star. If stellar spectra are controlled by three physical parameters (temperature, gravity, metallicity), then the $N \times 4000$ data matrix has an intrinsic dimensionality of roughly 3 — the 4000-dimensional data points lie near a 3-dimensional surface in pixel space. Discovering that surface, and the coordinates on it, is the goal of latent variable modelling.

The practical payoffs are:

Compression: represent each spectrum by 3 numbers instead of 4000
Visualisation: plot 3D structure that is invisible in any single pixel
Denoising: project noisy observations onto the low-dimensional surface
Anomaly detection: points far from the surface are unusual and worth investigating

2. Matrix Factorisation

2.1 The Basic Idea

The algebraic foundation of most linear dimensionality reduction methods is matrix factorisation: decomposing a data matrix into a product of smaller matrices.

Let $\mathbf{Y}$ be an $N \times D$ data matrix (rows are observations, columns are features). A rank-$K$ factorisation writes:

$$\mathbf{Y} \approx \mathbf{Z}\mathbf{W}^\top$$

where $\mathbf{Z}$ is $N \times K$ (the latent representations, or scores) and $\mathbf{W}$ is $D \times K$ (the loading matrix, or basis vectors). Each row of $\mathbf{Z}$ gives the $K$-dimensional coordinates of one data point; each column of $\mathbf{W}$ is a basis vector in the original $D$-dimensional space.

This is a latent variable model in disguise: the observation $\mathbf{y}_i$ is generated by taking a linear combination of $K$ basis vectors, with coefficients given by $\mathbf{z}_i$.

2.2 A Concrete Example: Movie Recommendations

To build intuition, consider the Netflix problem. You have an $N \times D$ matrix of user–movie ratings, where most entries are missing. A matrix factorisation model assumes each user $i$ has a latent preference vector $\mathbf{z}_i \in \mathbb{R}^K$ and each movie $j$ has a latent attribute vector $\mathbf{w}_j \in \mathbb{R}^K$, and the rating is:

$$r_{ij} \approx \mathbf{z}_i^\top \mathbf{w}_j$$

The latent dimensions might correspond to interpretable axes — "action vs. drama", "mainstream vs. indie" — but they are learned from the data, not defined in advance. Fitting the model means minimising a loss function like:

$$\min_{\mathbf{Z}, \mathbf{W}} \sum_{(i,j) \in \text{observed}} (r_{ij} - \mathbf{z}_i^\top \mathbf{w}_j)^2 + \lambda(\|\mathbf{Z}\|^2 + \|\mathbf{W}\|^2)$$

where $\lambda$ is a regularisation parameter that prevents overfitting to the sparse observations.

ℹ️

From movies to spectra. The same decomposition works for astronomical data. Replace users with stars, movies with wavelength pixels, and ratings with flux values. The latent factors become physical properties of the stars, and the loading vectors become spectral templates. The mathematics is identical — only the interpretation changes.

2.3 The Singular Value Decomposition

The optimal rank-$K$ factorisation (in the least-squares sense) is given by the singular value decomposition (SVD). Any $N \times D$ matrix $\mathbf{Y}$ can be written as:

$$\mathbf{Y} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$$

where $\mathbf{U}$ is $N \times N$ (left singular vectors), $\boldsymbol{\Sigma}$ is $N \times D$ (diagonal matrix of singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$), and $\mathbf{V}$ is $D \times D$ (right singular vectors). The best rank-$K$ approximation is obtained by keeping only the $K$ largest singular values:

$$\mathbf{Y} \approx \mathbf{U}_K \boldsymbol{\Sigma}_K \mathbf{V}_K^\top$$

This is the Eckart–Young theorem: no other rank-$K$ matrix is closer to $\mathbf{Y}$ in the Frobenius norm.

3. Principal Component Analysis

PCA is the most widely used dimensionality reduction method in the sciences, and it is nothing more than the SVD applied to mean-centred data.

3.1 The Algorithm

Given an $N \times D$ data matrix $\mathbf{Y}$:

Centre the data: subtract the column means to get $\mathbf{X} = \mathbf{Y} - \bar{\mathbf{Y}}$
Compute the covariance matrix: $\mathbf{C} = \frac{1}{N-1}\mathbf{X}^\top\mathbf{X}$
Eigendecompose: $\mathbf{C} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top$, where $\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_D)$ with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D \geq 0$
Project: the principal component scores are $\mathbf{Z} = \mathbf{X}\mathbf{V}_K$, where $\mathbf{V}_K$ is the $D \times K$ matrix of the first $K$ eigenvectors

The eigenvectors $\mathbf{v}_k$ are the principal directions — orthogonal axes in the original space, ordered by the amount of variance they explain. The eigenvalues $\lambda_k$ give the variance along each direction. The fraction of variance explained by the first $K$ components is:

$$f_K = \frac{\sum_{k=1}^{K} \lambda_k}{\sum_{k=1}^{D} \lambda_k}$$

3.2 Connection to the SVD

The eigenvectors of the covariance matrix are the right singular vectors of $\mathbf{X}$, and the eigenvalues are related to the singular values by $\lambda_k = \sigma_k^2 / (N-1)$. In practice, most software computes PCA via the SVD rather than by forming and eigendecomposing the covariance matrix, because the SVD is numerically more stable.

3.3 What PCA Does and Does Not Do

PCA finds the directions of maximum variance — the axes along which the data spread out most. This is useful when variance corresponds to signal, which is often the case.

But PCA has important limitations:

Linearity: PCA can only find linear subspaces. If the data lie on a curved manifold, PCA will miss the structure or require many more components than the true dimensionality.
Variance $\neq$ importance: PCA ranks directions by variance, but the most variable direction is not always the most scientifically interesting. Noise can dominate signal in variance.
Orthogonality constraint: PCA forces its components to be orthogonal. This is mathematically convenient but has no physical justification — temperature and metallicity need not be orthogonal in spectral space.
Global method: PCA uses the same linear projection everywhere. It cannot adapt to local structure in different parts of the data.

⚠️

A common mistake. PCA on a colour-magnitude diagram will find the direction of maximum spread, which is typically along the main sequence. This is not the same as finding the main sequence — PCA does not know about stellar physics, and the first principal component will be influenced by every source of variance in the data, including photometric errors and reddening.

3.4 Choosing $K$

How many components should you keep? Common strategies:

Scree plot: plot $\lambda_k$ vs. $k$ and look for an "elbow" where the eigenvalues drop sharply. Components before the elbow capture structure; components after capture noise.
Cumulative variance threshold: keep enough components to explain some fraction of the variance (e.g., $f_K > 0.95$).
Cross-validation: hold out some data entries, fit PCA on the rest, and measure reconstruction error on the held-out entries. The optimal $K$ minimises the held-out error.

None of these is definitive. The scree plot requires subjective judgement; the variance threshold is arbitrary; cross-validation is expensive but principled. In practice, try several values and check that your downstream analysis is not sensitive to the choice.

3.5 Interactive Visualisation

The following interactive visualisation (from a previous course) demonstrates PCA on a 3D point cloud. You can rotate the view, toggle the principal component axes, and project the data onto 2D to see how much structure is preserved — and what is lost.

4. Beyond Linearity: t-SNE

PCA is linear. It cannot unfold a Swiss roll, separate interleaved spirals, or reveal clusters that are obvious in local distance but invisible in global variance. For these tasks, you need nonlinear methods. The most influential is t-distributed Stochastic Neighbour Embedding (t-SNE).

4.1 The Core Idea

t-SNE defines a probability distribution over pairs of points in the high-dimensional space, then finds a low-dimensional embedding whose pairwise distribution matches it as closely as possible.

Step 1: High-dimensional similarities. For each pair of points $(i, j)$ in the original $D$-dimensional space, define a conditional probability:

$$p_{j|i} = \frac{\exp(-\|\mathbf{y}_i - \mathbf{y}_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|\mathbf{y}_i - \mathbf{y}_k\|^2 / 2\sigma_i^2)}$$

This says: if you were standing at point $i$, how likely would you be to pick point $j$ as a neighbour, where "likelihood" falls off as a Gaussian with width $\sigma_i$? The symmetrised version is $p_{ij} = (p_{j|i} + p_{i|j}) / 2N$.

The bandwidth $\sigma_i$ is set so that the effective number of neighbours — the perplexity — matches a user-specified value. Perplexity is defined as $2^{H(P_i)}$ where $H$ is the Shannon entropy of the conditional distribution. A perplexity of 30 means each point effectively has about 30 neighbours.

Step 2: Low-dimensional similarities. Place the points at positions $\mathbf{x}_1, \ldots, \mathbf{x}_N$ in 2D (or 3D), and define:

$$q_{ij} = \frac{(1 + \|\mathbf{x}_i - \mathbf{x}_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|\mathbf{x}_k - \mathbf{x}_l\|^2)^{-1}}$$

This uses a Student-$t$ distribution (with one degree of freedom, i.e., a Cauchy distribution) instead of a Gaussian. The heavy tails of the $t$-distribution give distant points in high dimensions more room to spread out in the embedding, alleviating the "crowding problem" that plagues earlier methods.

Step 3: Optimise. Find the positions $\mathbf{x}_i$ that minimise the Kullback–Leibler divergence between $P$ and $Q$:

$$\text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$

This is done by gradient descent. Since KL divergence is asymmetric, minimising $\text{KL}(P \| Q)$ penalises cases where $p_{ij}$ is large but $q_{ij}$ is small (i.e., nearby points in high-$D$ that are far apart in the embedding) more than the reverse. The effect is that t-SNE preserves local structure (nearby points stay nearby) but is not reliable for global structure (the distances between well-separated clusters are not meaningful).

4.2 The Perplexity Hyperparameter

Perplexity controls the balance between local and global structure, and it has a dramatic effect on the output:

Low perplexity ($\sim 5$): each point only cares about its immediate neighbours. The embedding shows tight, fragmented clusters. False structure can appear.
High perplexity ($\sim 100$): each point considers many neighbours. The embedding shows broader, more diffuse structure. True local clusters may merge.
Very high perplexity ($\to N$): the embedding approaches a global method and begins to resemble PCA.

There is no "correct" perplexity. Best practice is to run t-SNE at several perplexity values and check which features are stable across runs.

⚠️

What you cannot conclude from a t-SNE plot. (1) The sizes of clusters are not meaningful — t-SNE expands dense regions and contracts sparse ones. (2) The distances between clusters are not meaningful — well-separated clusters may be placed at arbitrary distances. (3) The number of clusters may be an artefact of perplexity. A t-SNE plot is a visualisation tool, not a quantitative measurement. Always verify apparent structure with a proper model.

4.3 Practical Considerations

Non-deterministic: t-SNE uses random initialisation, so different runs produce different embeddings. Always check reproducibility across random seeds.
No out-of-sample mapping: t-SNE does not define a function $f: \mathbb{R}^D \to \mathbb{R}^2$. You cannot project new data points without re-running the entire algorithm. This makes t-SNE a visualisation tool, not a compression tool.
Computational cost: naive t-SNE is $O(N^2)$; the Barnes–Hut approximation reduces this to $O(N \log N)$ for large datasets.
Preprocessing: it is common to run PCA first (reducing to 50–100 dimensions) and then apply t-SNE to the PCA scores. This speeds up computation and removes noise in high-variance directions.

5. UMAP: Uniform Manifold Approximation and Projection

UMAP is a more recent nonlinear embedding method that often produces similar visualisations to t-SNE but with several practical advantages:

Speed: UMAP is significantly faster than t-SNE, especially for large datasets ($N > 10^5$).
Global structure: UMAP tends to preserve more of the large-scale structure. Relative positions of clusters in a UMAP embedding are more meaningful than in t-SNE (though still not quantitatively reliable).
Out-of-sample mapping: UMAP learns an approximate function that can project new points without re-running the full algorithm.
Theoretical foundation: UMAP is motivated by Riemannian geometry and algebraic topology. It constructs a fuzzy simplicial complex (a weighted graph) from the high-dimensional data, then optimises a low-dimensional graph to match it. The practical effect is similar to t-SNE — preserve local neighbourhoods — but the mathematical framework is different.

5.1 Key Hyperparameters

UMAP's main hyperparameter is n_neighbors, which plays a role analogous to perplexity in t-SNE:

Small n_neighbors ($\sim 5$): emphasises very local structure. Fine-grained clusters appear.
Large n_neighbors ($\sim 200$): captures more global topology. Clusters merge and the embedding smooths out.

The min_dist parameter controls how tightly points can pack in the embedding:

Small min_dist ($\sim 0.0$): points are allowed to clump tightly, producing dense clusters with clear separation.
Large min_dist ($\sim 1.0$): points are spread more evenly, preserving the local manifold structure but reducing visual cluster separation.

5.2 UMAP vs. t-SNE

	t-SNE	UMAP
Speed	$O(N \log N)$ with Barnes–Hut	Faster for large $N$
Global structure	Poor	Better (but still approximate)
Out-of-sample	No	Yes
Deterministic	No	Approximately (with fixed seed)
Cluster sizes	Not meaningful	More meaningful

In practice, both methods are used for exploratory visualisation, and the same caveats apply to both: always verify apparent structure with a generative model or statistical test. Do not publish a t-SNE or UMAP plot as evidence that clusters "exist" without further validation.

6. Choosing the Right Method

The methods in this lecture form a spectrum from interpretable to flexible:

Method	Linear?	Preserves	Defines a mapping?	Use when...
PCA	Yes	Global variance	Yes	You need compression, denoising, or a first look at structure
t-SNE	No	Local distances	No	You want to visualise clusters or local neighbourhoods
UMAP	No	Local + some global	Yes	You need fast, scalable embedding with better global layout

A common workflow is:

Start with PCA to understand the gross structure, choose a reasonable number of components, and remove noise
Apply t-SNE or UMAP to the PCA scores for visualisation
Fit a generative model (e.g., a Gaussian mixture model from lecture 13) to the PCA scores or original data to test whether the apparent clusters are real

The key principle is: use linear methods for measurement, nonlinear methods for visualisation, and generative models for inference.

ℹ️

The manifold hypothesis. The assumption underlying all dimensionality reduction is that high-dimensional data lie near a low-dimensional manifold. This is an empirical claim, not a mathematical guarantee — and it is often approximately true for physical data. Stellar spectra, galaxy images, and gravitational wave templates all have far fewer degrees of freedom than their pixel count suggests. When the manifold hypothesis holds, dimensionality reduction works; when it fails, you are projecting noise.

7. Summary

Latent variable models posit that observed high-dimensional data are generated by a smaller number of unobserved continuous variables.
Matrix factorisation is the algebraic framework: decompose $\mathbf{Y} \approx \mathbf{Z}\mathbf{W}^\top$ to find low-rank structure.
PCA finds the linear subspace of maximum variance via eigendecomposition of the covariance matrix (or equivalently, the SVD). It is fast, interpretable, and globally optimal for linear reconstruction — but cannot capture nonlinear structure.
t-SNE preserves local neighbourhood structure through a probabilistic embedding that minimises KL divergence. It excels at revealing clusters but its output is not metrically faithful and depends on the perplexity hyperparameter.
UMAP achieves similar goals to t-SNE with better speed, global structure preservation, and out-of-sample support.
No embedding method replaces a generative model. Use dimensionality reduction for exploration and visualisation; use mixture models (lecture 13), hierarchical models (lecture 12), or Bayesian model comparison (lecture 15) for inference.

Last updated on 26 March, 2026

Mixture Models Comparing And Selecting Models