Latent Variable Models
Lecture 13 introduced latent discrete variables — unobserved labels that assign each data point to a mixture component. This lecture introduces latent continuous variables: unobserved quantities that drive the correlations you see in high-dimensional data. The central idea is dimensionality reduction — the possibility that a dataset living in $D$ dimensions is actually controlled by a much smaller number of $K \ll D$ hidden factors.
This idea is everywhere in astronomy. Stellar spectra have thousands of wavelength pixels, but the variation between stars is dominated by a handful of physical parameters: effective temperature, surface gravity, metallicity. Galaxy spectral energy distributions span dozens of photometric bands, but a few latent factors — stellar mass, star formation history, dust — explain most of the variance. If you can discover those latent factors from the data alone, without knowing the physics in advance, you have a powerful tool for exploration, compression, and discovery.
We will build up from matrix factorisation (the algebraic backbone), through Principal Component Analysis (the workhorse), to nonlinear embedding methods (t-SNE and UMAP) that reveal structure PCA cannot see. Along the way, the recurring theme is the tension between linear methods that are interpretable and globally faithful, and nonlinear methods that capture local structure but can mislead.
1. The Latent Variable Idea
1.1 From Mixture Models to Continuous Latent Variables
In a mixture model, each data point $y_i$ is associated with a discrete latent variable $z_i \in \{1, \ldots, K\}$. The generative story is: draw a label, then draw the observation conditioned on that label. Now replace the discrete label with a continuous latent vector $\mathbf{z}_i \in \mathbb{R}^K$:
- Draw a latent vector: $\mathbf{z}_i \sim p(\mathbf{z})$
- Generate the observation: $\mathbf{y}_i \sim p(\mathbf{y} \mid \mathbf{z}_i)$
If $K < D$, the model asserts that the high-dimensional observations are generated by a low-dimensional process. The latent variables $\mathbf{z}_i$ are the "true" coordinates; the observations $\mathbf{y}_i$ are a noisy, high-dimensional reflection of them.
1.2 Why This Matters
Consider a spectroscopic survey that measures flux at $D = 4000$ wavelength pixels for each star. If stellar spectra are controlled by three physical parameters (temperature, gravity, metallicity), then the $N \times 4000$ data matrix has an intrinsic dimensionality of roughly 3 — the 4000-dimensional data points lie near a 3-dimensional surface in pixel space. Discovering that surface, and the coordinates on it, is the goal of latent variable modelling.
The practical payoffs are:
- Compression: represent each spectrum by 3 numbers instead of 4000
- Visualisation: plot 3D structure that is invisible in any single pixel
- Denoising: project noisy observations onto the low-dimensional surface
- Anomaly detection: points far from the surface are unusual and worth investigating
2. Matrix Factorisation
2.1 The Basic Idea
The algebraic foundation of most linear dimensionality reduction methods is matrix factorisation: decomposing a data matrix into a product of smaller matrices.
Let $\mathbf{Y}$ be an $N \times D$ data matrix (rows are observations, columns are features). A rank-$K$ factorisation writes:
$$\mathbf{Y} \approx \mathbf{Z}\mathbf{W}^\top$$where $\mathbf{Z}$ is $N \times K$ (the latent representations, or scores) and $\mathbf{W}$ is $D \times K$ (the loading matrix, or basis vectors). Each row of $\mathbf{Z}$ gives the $K$-dimensional coordinates of one data point; each column of $\mathbf{W}$ is a basis vector in the original $D$-dimensional space.
This is a latent variable model in disguise: the observation $\mathbf{y}_i$ is generated by taking a linear combination of $K$ basis vectors, with coefficients given by $\mathbf{z}_i$.
2.2 A Concrete Example: Movie Recommendations
To build intuition, consider the Netflix problem. You have an $N \times D$ matrix of user–movie ratings, where most entries are missing. A matrix factorisation model assumes each user $i$ has a latent preference vector $\mathbf{z}_i \in \mathbb{R}^K$ and each movie $j$ has a latent attribute vector $\mathbf{w}_j \in \mathbb{R}^K$, and the rating is:
$$r_{ij} \approx \mathbf{z}_i^\top \mathbf{w}_j$$The latent dimensions might correspond to interpretable axes — "action vs. drama", "mainstream vs. indie" — but they are learned from the data, not defined in advance. Fitting the model means minimising a loss function like:
$$\min_{\mathbf{Z}, \mathbf{W}} \sum_{(i,j) \in \text{observed}} (r_{ij} - \mathbf{z}_i^\top \mathbf{w}_j)^2 + \lambda(\|\mathbf{Z}\|^2 + \|\mathbf{W}\|^2)$$where $\lambda$ is a regularisation parameter that prevents overfitting to the sparse observations.
2.3 The Singular Value Decomposition
The optimal rank-$K$ factorisation (in the least-squares sense) is given by the singular value decomposition (SVD). Any $N \times D$ matrix $\mathbf{Y}$ can be written as:
$$\mathbf{Y} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$$where $\mathbf{U}$ is $N \times N$ (left singular vectors), $\boldsymbol{\Sigma}$ is $N \times D$ (diagonal matrix of singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$), and $\mathbf{V}$ is $D \times D$ (right singular vectors). The best rank-$K$ approximation is obtained by keeping only the $K$ largest singular values:
$$\mathbf{Y} \approx \mathbf{U}_K \boldsymbol{\Sigma}_K \mathbf{V}_K^\top$$This is the Eckart–Young theorem: no other rank-$K$ matrix is closer to $\mathbf{Y}$ in the Frobenius norm.
3. Principal Component Analysis
PCA is the most widely used dimensionality reduction method in the sciences, and it is nothing more than the SVD applied to mean-centred data.
3.1 The Algorithm
Given an $N \times D$ data matrix $\mathbf{Y}$:
- Centre the data: subtract the column means to get $\mathbf{X} = \mathbf{Y} - \bar{\mathbf{Y}}$
- Compute the covariance matrix: $\mathbf{C} = \frac{1}{N-1}\mathbf{X}^\top\mathbf{X}$
- Eigendecompose: $\mathbf{C} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top$, where $\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_D)$ with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D \geq 0$
- Project: the principal component scores are $\mathbf{Z} = \mathbf{X}\mathbf{V}_K$, where $\mathbf{V}_K$ is the $D \times K$ matrix of the first $K$ eigenvectors
The eigenvectors $\mathbf{v}_k$ are the principal directions — orthogonal axes in the original space, ordered by the amount of variance they explain. The eigenvalues $\lambda_k$ give the variance along each direction. The fraction of variance explained by the first $K$ components is:
$$f_K = \frac{\sum_{k=1}^{K} \lambda_k}{\sum_{k=1}^{D} \lambda_k}$$3.2 Connection to the SVD
The eigenvectors of the covariance matrix are the right singular vectors of $\mathbf{X}$, and the eigenvalues are related to the singular values by $\lambda_k = \sigma_k^2 / (N-1)$. In practice, most software computes PCA via the SVD rather than by forming and eigendecomposing the covariance matrix, because the SVD is numerically more stable.
3.3 What PCA Does and Does Not Do
PCA finds the directions of maximum variance — the axes along which the data spread out most. This is useful when variance corresponds to signal, which is often the case.
But PCA has important limitations:
- Linearity: PCA can only find linear subspaces. If the data lie on a curved manifold, PCA will miss the structure or require many more components than the true dimensionality.
- Variance $\neq$ importance: PCA ranks directions by variance, but the most variable direction is not always the most scientifically interesting. Noise can dominate signal in variance.
- Orthogonality constraint: PCA forces its components to be orthogonal. This is mathematically convenient but has no physical justification — temperature and metallicity need not be orthogonal in spectral space.
- Global method: PCA uses the same linear projection everywhere. It cannot adapt to local structure in different parts of the data.
3.4 Choosing $K$
How many components should you keep? Common strategies:
- Scree plot: plot $\lambda_k$ vs. $k$ and look for an "elbow" where the eigenvalues drop sharply. Components before the elbow capture structure; components after capture noise.
- Cumulative variance threshold: keep enough components to explain some fraction of the variance (e.g., $f_K > 0.95$).
- Cross-validation: hold out some data entries, fit PCA on the rest, and measure reconstruction error on the held-out entries. The optimal $K$ minimises the held-out error.
None of these is definitive. The scree plot requires subjective judgement; the variance threshold is arbitrary; cross-validation is expensive but principled. In practice, try several values and check that your downstream analysis is not sensitive to the choice.
3.5 Interactive Visualisation
The following interactive visualisation (from a previous course) demonstrates PCA on a 3D point cloud. You can rotate the view, toggle the principal component axes, and project the data onto 2D to see how much structure is preserved — and what is lost.
4. Beyond Linearity: t-SNE
PCA is linear. It cannot unfold a Swiss roll, separate interleaved spirals, or reveal clusters that are obvious in local distance but invisible in global variance. For these tasks, you need nonlinear methods. The most influential is t-distributed Stochastic Neighbour Embedding (t-SNE).
4.1 The Core Idea
t-SNE defines a probability distribution over pairs of points in the high-dimensional space, then finds a low-dimensional embedding whose pairwise distribution matches it as closely as possible.
Step 1: High-dimensional similarities. For each pair of points $(i, j)$ in the original $D$-dimensional space, define a conditional probability:
$$p_{j|i} = \frac{\exp(-\|\mathbf{y}_i - \mathbf{y}_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|\mathbf{y}_i - \mathbf{y}_k\|^2 / 2\sigma_i^2)}$$This says: if you were standing at point $i$, how likely would you be to pick point $j$ as a neighbour, where "likelihood" falls off as a Gaussian with width $\sigma_i$? The symmetrised version is $p_{ij} = (p_{j|i} + p_{i|j}) / 2N$.
The bandwidth $\sigma_i$ is set so that the effective number of neighbours — the perplexity — matches a user-specified value. Perplexity is defined as $2^{H(P_i)}$ where $H$ is the Shannon entropy of the conditional distribution. A perplexity of 30 means each point effectively has about 30 neighbours.
Step 2: Low-dimensional similarities. Place the points at positions $\mathbf{x}_1, \ldots, \mathbf{x}_N$ in 2D (or 3D), and define:
$$q_{ij} = \frac{(1 + \|\mathbf{x}_i - \mathbf{x}_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|\mathbf{x}_k - \mathbf{x}_l\|^2)^{-1}}$$This uses a Student-$t$ distribution (with one degree of freedom, i.e., a Cauchy distribution) instead of a Gaussian. The heavy tails of the $t$-distribution give distant points in high dimensions more room to spread out in the embedding, alleviating the "crowding problem" that plagues earlier methods.
Step 3: Optimise. Find the positions $\mathbf{x}_i$ that minimise the Kullback–Leibler divergence between $P$ and $Q$:
$$\text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$This is done by gradient descent. Since KL divergence is asymmetric, minimising $\text{KL}(P \| Q)$ penalises cases where $p_{ij}$ is large but $q_{ij}$ is small (i.e., nearby points in high-$D$ that are far apart in the embedding) more than the reverse. The effect is that t-SNE preserves local structure (nearby points stay nearby) but is not reliable for global structure (the distances between well-separated clusters are not meaningful).
4.2 The Perplexity Hyperparameter
Perplexity controls the balance between local and global structure, and it has a dramatic effect on the output:
- Low perplexity ($\sim 5$): each point only cares about its immediate neighbours. The embedding shows tight, fragmented clusters. False structure can appear.
- High perplexity ($\sim 100$): each point considers many neighbours. The embedding shows broader, more diffuse structure. True local clusters may merge.
- Very high perplexity ($\to N$): the embedding approaches a global method and begins to resemble PCA.
There is no "correct" perplexity. Best practice is to run t-SNE at several perplexity values and check which features are stable across runs.
4.3 Practical Considerations
- Non-deterministic: t-SNE uses random initialisation, so different runs produce different embeddings. Always check reproducibility across random seeds.
- No out-of-sample mapping: t-SNE does not define a function $f: \mathbb{R}^D \to \mathbb{R}^2$. You cannot project new data points without re-running the entire algorithm. This makes t-SNE a visualisation tool, not a compression tool.
- Computational cost: naive t-SNE is $O(N^2)$; the Barnes–Hut approximation reduces this to $O(N \log N)$ for large datasets.
- Preprocessing: it is common to run PCA first (reducing to 50–100 dimensions) and then apply t-SNE to the PCA scores. This speeds up computation and removes noise in high-variance directions.
5. UMAP: Uniform Manifold Approximation and Projection
UMAP is a more recent nonlinear embedding method that often produces similar visualisations to t-SNE but with several practical advantages:
- Speed: UMAP is significantly faster than t-SNE, especially for large datasets ($N > 10^5$).
- Global structure: UMAP tends to preserve more of the large-scale structure. Relative positions of clusters in a UMAP embedding are more meaningful than in t-SNE (though still not quantitatively reliable).
- Out-of-sample mapping: UMAP learns an approximate function that can project new points without re-running the full algorithm.
- Theoretical foundation: UMAP is motivated by Riemannian geometry and algebraic topology. It constructs a fuzzy simplicial complex (a weighted graph) from the high-dimensional data, then optimises a low-dimensional graph to match it. The practical effect is similar to t-SNE — preserve local neighbourhoods — but the mathematical framework is different.
5.1 Key Hyperparameters
UMAP's main hyperparameter is n_neighbors, which plays a role analogous to perplexity in t-SNE:
- Small
n_neighbors($\sim 5$): emphasises very local structure. Fine-grained clusters appear. - Large
n_neighbors($\sim 200$): captures more global topology. Clusters merge and the embedding smooths out.
The min_dist parameter controls how tightly points can pack in the embedding:
- Small
min_dist($\sim 0.0$): points are allowed to clump tightly, producing dense clusters with clear separation. - Large
min_dist($\sim 1.0$): points are spread more evenly, preserving the local manifold structure but reducing visual cluster separation.
5.2 UMAP vs. t-SNE
| t-SNE | UMAP | |
|---|---|---|
| Speed | $O(N \log N)$ with Barnes–Hut | Faster for large $N$ |
| Global structure | Poor | Better (but still approximate) |
| Out-of-sample | No | Yes |
| Deterministic | No | Approximately (with fixed seed) |
| Cluster sizes | Not meaningful | More meaningful |
In practice, both methods are used for exploratory visualisation, and the same caveats apply to both: always verify apparent structure with a generative model or statistical test. Do not publish a t-SNE or UMAP plot as evidence that clusters "exist" without further validation.
6. Choosing the Right Method
The methods in this lecture form a spectrum from interpretable to flexible:
| Method | Linear? | Preserves | Defines a mapping? | Use when... |
|---|---|---|---|---|
| PCA | Yes | Global variance | Yes | You need compression, denoising, or a first look at structure |
| t-SNE | No | Local distances | No | You want to visualise clusters or local neighbourhoods |
| UMAP | No | Local + some global | Yes | You need fast, scalable embedding with better global layout |
A common workflow is:
- Start with PCA to understand the gross structure, choose a reasonable number of components, and remove noise
- Apply t-SNE or UMAP to the PCA scores for visualisation
- Fit a generative model (e.g., a Gaussian mixture model from lecture 13) to the PCA scores or original data to test whether the apparent clusters are real
The key principle is: use linear methods for measurement, nonlinear methods for visualisation, and generative models for inference.
7. Summary
- Latent variable models posit that observed high-dimensional data are generated by a smaller number of unobserved continuous variables.
- Matrix factorisation is the algebraic framework: decompose $\mathbf{Y} \approx \mathbf{Z}\mathbf{W}^\top$ to find low-rank structure.
- PCA finds the linear subspace of maximum variance via eigendecomposition of the covariance matrix (or equivalently, the SVD). It is fast, interpretable, and globally optimal for linear reconstruction — but cannot capture nonlinear structure.
- t-SNE preserves local neighbourhood structure through a probabilistic embedding that minimises KL divergence. It excels at revealing clusters but its output is not metrically faithful and depends on the perplexity hyperparameter.
- UMAP achieves similar goals to t-SNE with better speed, global structure preservation, and out-of-sample support.
- No embedding method replaces a generative model. Use dimensionality reduction for exploration and visualisation; use mixture models (lecture 13), hierarchical models (lecture 12), or Bayesian model comparison (lecture 15) for inference.