Gaussian processes

Data analysis and machine learning

(This lecture has a lot of linear algebra in it, even though I have omitted the proofs! Sometimes there's so much math that MathJax fails to render the LaTeX correctly. If that happens to you, just refresh until things work properly.)

 

In the last class we used the odds ratioOr Bayes factor, assuming prior odds of unity. to evaluate whether a second-order polynomial model was preferred over a straight line model, given some data. In other words: we were trying to decide which function was a better description for the data. But we only considered two different functions! In this class you're going to be introduced to Gaussian processes, which represent the limit of what happens if you consider an infinite number of possible functions.🤯 Here we will specifically focus on Gaussian process regression, where we are fitting a model to some data, but Gaussian processes have use in many different tasks (e.g., classification).

Gaussian processes sit at a very important intersection between data analysis and machine learning: they are very popular in both "fields" because they are extremely flexible, (somewhat) interpretable, and probabilistic! So you should learn about them!The bible for Gaussian processes is Rasmussen & Williams (2006).

An important preface

There are several ways to think of Gaussian processes. And there are many ways that Gaussian processes will relate to other concepts that you have covered before, including those from basic signal processing, probabilities, and quantum mechanics, to name a few. You will probably find that some parts of Gaussian processes are easy to understand, and some parts are extremely difficult to understand. For this reason I am going to explain Gaussian processes in two different ways. I encourage you to accept that there will be one way that is easier for you to understand than the alternative. And for the purposes of being time-limited, I encourage you to be mentally prepared to accept something to be true without going through the rigorous linear algebra proof!Although these notes will take you most of the way there

The two views of Gaussian processes that we will discuss in this class will be the weight-space view and the function-space view. We will introduce the weight-space view first because it has more mathematical rigour, and then we'll introduce the function-space view to help you conceptualise this crazy idea!

In both views the end result will be the same. Instead of modelling the data in data space, a Gaussian process models the correlation between data points based on the distance between those points. In other words, the covariance matrix describes the data rather than a prescribed model \(f(x)\).

\[ \newcommand\vectheta{\boldsymbol{\theta}} \newcommand\vecdata{\mathbf{y}} \newcommand\model{\boldsymbol{\mathcal{M}}} \newcommand\vecx{\boldsymbol{x}} \newcommand{\transpose}{^{\scriptscriptstyle \top}} \renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\cov}{\mathbf{C}} \]

Weight-space view

Let us imagine that we have some data that are Gaussian-distributed such that the log probability is \[ \log p(\vecdata|\vectheta,\vec{\sigma}_y,\vec{x}) = -\frac{1}{2}\left(\vecdata - \vec{\mu}(\vectheta)\right)\transpose\cov^{-1}(\vecdata - \vec{\mu}(\vectheta)) -\frac{1}{2}\log{|2\pi\cov|} \quad . \] We will assume that \(\vecdata\) is a one-dimensional vector of \(N\) data points and \(\vec{\mu}(\vectheta)\) describes the mean model given some parameters \(\vectheta\). We will also assume that there are a number of systematic effects that we can't explicitly write down functions for, but we know that those systematic effects are present in the data. An example could be that if your data are some time series observations of the brightness of a star measured from a satellite in spaceExample stolen from this lecture by David W. Hogg., then the systematics could include things like the telescope pointing stability over time, or the temperature of the spacecraft over time, or countless other things. We might know some of these things (e.g., they are recorded with time) but we don't know the impact that each one of those potential systematics have. That is: we don't know the functional form or the weight or importance of each of those systematics.

Because we don't know the importance of each systematic, we will introduce each systematic to have an assigned weight \(w\). For brevity we will use \(\vec{\psi} \equiv \{\vec{\sigma}_y,\vec{x},\vec{A}\}\). We should then re-write our log probability to be \[ \log p(\vecdata|\vectheta,\vec{\psi},\vec{w}) = -\frac{1}{2}\left(\vecdata - \vec{\mu}(\vectheta) - \vec{A}\vec{w}\right)\transpose\cov^{-1}(\vecdata - \vec{\mu}(\vectheta)- \vec{A}\vec{w}) -\frac{1}{2}\log{|2\pi\cov|} \quad . \] where \(\vec{w}\) is a \(M\)-dimensional vector of unknown weights, and \(\vec{A}\) is an \(N \times M\) dimensional matrix of possible systematic effects, which we will define as the design matrix. And the design matrix \(\vec{A}\) could include any number of things, including telescope temperature \(T\), or temperature squared \(T^2\), or the number of pieces of toast I had for breakfast each day while the telescope observed the star.The astute among you will realise that \(\vec{A}\) might also contain entries like \(\vec{x}\), \(\vec{x}^2\), \(\vec{x}^3\), et cetera, and this becomes a polynomial function but remains a linear problem because we only construct the design matrix once! For unrelated systematics (e.g., toast) we would expect to find the corresponding weight value \(w\) to be near zero.

In practice we don't really want to know the weights. We just want to marginalise over them! You will recall though, that in order to marginalise over some parameter, we must introduce a prior on that parameter. We will assume a multivariate Gaussian prior on the weights that is centered at zero \[ p(\vec{w}) \sim \mathcal{N}\left(\vec{0},\vec{\Lambda}\right) \] with some covariance matrix \(\vec{\Lambda}\) such that \[ \log{p(\vec{w})} = -\frac{1}{2}\vec{w}\transpose\vec{\Lambda}^{-1}\vec{w} - \frac{1}{2}\log{|2\pi\vec{\Lambda}|} \quad . \]

Using the rule of conditional probability we can marginalise out \(\vec{w}\) \[ p(\vec{y}|\vec{\theta},\vec{\psi}) = \int_{-\infty}^{+\infty} p(\vec{w})p(\vec{y}|\vectheta,\vec{\psi},\vec{w}) d\vec{w} \] which expands to become something nasty looking: \[ p(\vec{y}|\vec{\theta},\vec{\psi}) = \int_{-\infty}^{+\infty} \frac{1}{|2\pi\vec{\Lambda}|^\frac{1}{2}} \exp{\left(-\frac{1}{2}\vec{w}\transpose\vec{\Lambda}^{-1}\vec{w}\right)\frac{1}{|2\pi\cov|^\frac{1}{2}}} \exp{\left(-\frac{1}{2}\left[\vec{r}-\vec{A}\vec{w}\right]\transpose{}\cov^{-1}\left[\vec{r}-\vec{A}\vec{w}\right]\right)} d\vec{w} \] where things look a bit nicer if we take \(\vec{r} = \vec{y} - \vec{\mu}(\vectheta) \) and write \[ p(\vec{y}|\vec{\theta},\vec{\psi}) = \frac{1}{|2\pi\vec{\Lambda}|^\frac{1}{2}|2\pi\cov|^\frac{1}{2}} \int_{-\infty}^{+\infty} \exp\left(-\frac{1}{2}\vec{z}\right) d\vec{w} \] where \[ \vec{z} = \vec{w}\transpose\vec{\Lambda}^{-1}\vec{w} + (\vec{r} - \vec{Aw})\transpose\cov^{-1}(\vec{r}-\vec{Aw}) \quad . \] Since this integral is over \(d\vec{w}\) and we know the magical properties of Gaussians,Like the convolution of a Gaussian with a Gaussian is a Gaussian. we should try to re-cast this problem as: \[ \vec{z} = (\vec{w} - \vec{h})\transpose\vec{\Sigma}^{-1}(\vec{w} - \vec{h}) + k \] To do this we need to "complete the square". If you don't care for the derivation then it can be shown that \[ \begin{eqnarray} \vec{\Sigma}^{-1} &=& \vec{\Lambda}^{-1} + \vec{A}\transpose\cov^{-1}\vec{A} \\ \vec{h} &=& \vec{\Sigma}\vec{A}\transpose\cov^{-1}\vec{r} \\ k &=& \vec{r}\transpose \left(\cov^{-1} - \cov^{-1}\vec{A}\vec{\Sigma}\vec{A}\transpose\cov^{-1}\right)\vec{r} \end{eqnarray} \] and as such we can re-write our integral as \[ p(\vecdata|\vectheta,\vec{\psi}) = \frac{1}{|2\pi\vec{\Lambda}|^\frac{1}{2}|2\pi\cov|^\frac{1}{2}}\exp\left(-\frac{k}{2}\right) \int_{-\infty}^{+\infty} \exp\left(-\frac{1}{2}[\vec{w} - \vec{h}]\transpose\vec{\Sigma}^{-1}[\vec{w}-\vec{h}]\right) d\vec{w} \] where because multivariate Gaussians have a closed-form expression for the integralGaussians are magical. \[ \int_{-\infty}^{+\infty} \exp\left(-\frac{1}{2}[\vec{w} - \vec{h}]\transpose\vec{\Sigma}^{-1}[\vec{w}-\vec{h}]\right) d\vec{w} = |2\pi\vec{\Sigma}|^\frac{1}{2} \] and thanks to the Matrix Determinant Lemma and the Woodbury Identity we can rewrite \(|\vec{\Sigma}|\) and \(k\) as \[ |\vec{\Sigma}| = |\vec{\Sigma}^{-1}|^{-1} = \frac{|\vec{\Lambda}||\cov|}{|\cov + \vec{A}\vec{\Lambda}\vec{A}\transpose|} \] and \[ k = \vec{r}\transpose\left(\cov + \vec{A}\vec{\Lambda}\vec{A}\transpose\right)^{-1}\vec{r} \] which gives the expression of \(p(\vec{y}|\vec{\theta},\vec{\psi})\) \[ p(\vec{y}|\vec{\theta},\vec{\psi}) = \frac{1}{|2\pi(\cov + \vec{A}\vec{\Lambda}\vec{A}\transpose)|^\frac{1}{2}}\exp\left(-\frac{1}{2}\left[\vec{y}-\vec{\mu}(\vec{\theta})\right]\transpose(\cov + \vec{A}\vec{\Lambda}\vec{A}\transpose)^{-1}\left[\vec{y}-\vec{\mu}(\vec{\theta})\right]\right) \]

It is worthwhile stopping for a moment to consider what we have done. We have introduced a design matrix \(\vec{A}\), which could include any number of \(M\) potential features (e.g., spacecraft temperature). Importantly, the design matrix could include features like \(\vec{x}\), \(\vec{x}^2\), ..., and any number of other basis functions. This still remains a linear problem (even if the design matrix contains non-linear terms of \(\vec{x}\)) because the design matrix is only calculated once. If we were to include \(M\) features that are all increasing terms of \(\vec{x}\) to some power, you can see that the weights \(\vec{w}\) that we have marginalised over are actually the coefficients of those \(\vec{x}^m\) terms. For example if the design matrix contained the entries \[ \vec{A}\transpose = [\vec{1}, \vec{x}, \vec{x}^2, \vec{x}^3, \vec{x}^4, \vec{x}^5, \cdots] \] then the dot product \(\vec{Aw}\) expands to \[ w_0 + w_1\vec{x} + w_2\vec{x}^2 + w_3\vec{x}^3 + w_4\vec{x}^4 + w_5\vec{x}^5 + \cdots \] which you will recognise as a polynomial problem, yet it remains a linear problem because \(\vec{A}\) is just an array that is calculated once and never changes.

You may be thinking: what about \(\vec{\Lambda}\)? Or why did we introduce \(\vec{\Lambda}\) and why does it not appear with \(\vec{\theta},\vec{\psi}\) since the probability of \(\vec{y}\) is actually conditioned on \(\vec{\Lambda}\). We're about to explain why, and things are going to get funky.

The kernel trick

We get to arbitrarily set the number of \(M\) features that we will put into our design matrix \(\vec{A}\). What happens as \(M \rightarrow \infty\)? The shape of \(\vec{A}\) is \(N \times M\) and the size of \(\vec{\Lambda}\) is \(M \times M\), so we will have some infinitely large matrices to take inner products of. But what would be in our infinite set of features, and why would we want to create infinitely large matrices?

It turns out this isn't such a crazy idea, for two reasons. The first reason relates to the complexity of the data. Imagine if the data has some highly non-linear relationship. If your design matrix only included linear terms (e.g., like \(\vec{x}\)) then your model won't be able to capture those non-linear relationships. And because we are thinking about the Gaussian process in the context of modelling unknown (potentially non-linear) systematics, we do want to capture those relationships! But if we included an infinite number of possible features (e.g., either \(\vec{x}^m\) terms or better yet \(\sin(\vec{x})\) and \(\cos(\vec{x})\) terms) then the model will be able to capture all kinds of non-linearities in the data. That is to say: even if there is no linear relation in the data space, there may be linear relations in a higher dimensional feature spaceIn other words: there may be some combination of the infinite features in the design matrix (infite being higher than the number of observations, hence higher dimensional), where there are linear relations we can make use of. which would make it easier for us to make non-linear predictions in data space.

The second reason relates to a special property of \(\vec{A}\vec{\Lambda}\vec{A}\transpose\). You will notice that this term contains only inner products. It turns out if you have an infinite dimensional set of vectors that only contain inner products, then you don't need to explicitly build these matrices or take their inner products! Instead you can re-cast the problem as a function where the element-wise entries of the matrix of \(\vec{A}\vec{\Lambda}\vec{A}\transpose\) can be computed directly \[ \left[\vec{A}\vec{\Lambda}\vec{A}\transpose\right]_{nn'} \rightarrow k(\vec{x}_n,\vec{x}_{n'}) \] without needing to explicitly construct those infinite dimensional matrices! As long as the output of the element-wise evaluations of that function (which we call the kernel function) produces a positive semi-definite matrix,This should actually be non-negative semi-definite, but for all intents and purposes you should think of it as a positive semi-definite matrix. The reason it can be non-negative (e.g., zero) is because the product \(\vec{A}\vec{\Lambda}\vec{A}\transpose\) is being added to the existing noise covariance matrix \(\cov\). All we need is for the sum of those two to be positive semi-definite. then we're in business! This is an incredible result that is frequently referred to as the "kernel trick" because it is a trick in many ways. Think about it: we are introducing an infinite dimensional matrix of features to capture non-linearity in our model, and then replacing an inner product of infinite dimensional matrices with a single function that depends only the distance between one point \(\vec{x}_n\) and on another point \(\vec{x}_{n'}\). The kernel trick is closely related to the properties of a Hilbert space (and Mercer's theorem) in that any inner product of matrices can be replaced by a function as long as that function returns a positive semi-definite matrix.

What this means is that we can implicitly include a \(N \times \infty\) shaped design matrix \(\vec{A}\) of features without ever having to construct it or without ever computing the inner product \(\vec{A}\vec{\Lambda}\vec{A}\transpose\)! It also means that we can ignore the dependence on \(\vec{\Lambda}\) (and \(\vec{A}\)) because we will re-write our problem as a function \(k(\vec{x}_{n},\vec{x}_{n'})\) instead of an inner product of infinite dimensional matrices.🤯

Before we introduce the function-space view it may be useful to see how predictions are actually made. Let us define the kernel function as \(k(\vec{x},\vec{x}')\) and after computing the element-wise entries you can construct a covariance matrix \(\vec{K}\). Now let us re-cast our probability \(p(\vec{y}|\vec{\theta},\vec{\psi})\) as \[ p(\vec{y}|\vec{\theta},\vec{\psi}) = \frac{1}{|2\pi(\cov + \vec{K})|^\frac{1}{2}}\exp\left(-\frac{1}{2}(\left[\vec{y} - \vec{\mu}(\vec{\theta})\right]\transpose\left[\cov + \vec{K}\right]^{-1}\left[\vec{y}-\vec{\mu}(\vec{\theta})\right]\right) \quad . \] We will denote \(\vec{y_*}\) as the \(y\)-value(s) we want to predict at given \(\vec{x_*}\) locations. To clarify: \(\vec{x}\) and \(\vec{y}\) will represent observations, and \(\vec{y_*}\) are the predictions we want to make for the values of \(y\) at \(x\)-positions \(\vec{x_*}\). We would like to make predictions for these values \(\vec{y_*}\). That is we want to compute the probability of \(\vec{y_*}\) given \(\vec{x_*}\) and the existing set of observations \(\{\vec{x},\vec{y},\vec{\sigma_y}\}\), \[ p(\vec{y_*}|\vec{x_*}, \vec{y}, \vec{x}, \vec{\sigma_y}, \vec{\theta}) = \mathcal{N}\left(\vec{y_*}|\vec{\mu_*},\vec{V_*}\right) \] where it can be shown that \[ \begin{eqnarray} \vec{\mu_*} &=& \vec{K_*}\cov^{-1}\vec{y} \\ \vec{V_*} &=& \vec{K_{**}} - \vec{K_*}\cov^{-1}\vec{K_*}\transpose \end{eqnarray} \] where \(\vec{K_{**}}\) is the constructed covariance matrix for just the \(\vec{x_*}\) points, and \(\vec{K_*}\) is the constructed covariance matrix for the observed points and the new points. You can think of this as \(\vec{K_{**}}\) being the Gaussian process prior for the variance in the limit of having no observations, and the \(\vec{K_*}\cov^{-1}\vec{K_*}\transpose\) incorporates the observations (and their uncertainties) and the correlation between the observed and new data points.

Before we dive deep into how we construct the covariance matrices \(\vec{K}\), or even what set of kernel functions \(k(\vec{x},\vec{x}')\) are allowed, let's introduce the function-space view.

Function-space view

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. You can imagine this as if all your data were drawn from a huge multivariate Gaussian distribution, and the covariance matrix of that distribution describes the data.

Let us imagine that there are two axes \(x\) and \(y\). We know that we will collect some data of \(x\) and \(y\), but before we measure any data there are literally an infinite number of functions that could generate data from \(x\) to \(y\). It could be a straight line, or a polynomial with an infinite number of terms! Before taking any data we have no idea what the true function is.

Gaussian processes assign a prior probability to each of the infinite number of possible functions. When we measure some data, the predictions we make for other locations (in \(x\) and \(y\)) are conditioned on the observations that we have seen. The more data we observe, the more confident we become in the predictions elsewhere. Consider this excellent example below from Distill, where initially our Gaussian process is set to have some variance in the \(y\) direction (which could be larger or smaller; it is set by our chosen kernel function), and we have no idea what the functional relationship is between \(x\) and \(y\).

Here you can see that when we start off with no data there is some variance in \(y\) (that we have set) and the mean of the model is \(y = 0\). But as soon as we add a (noiseless) data point, the Gaussian process is forced (by definition) to go through that data point. And our predictions for nearby values has suddenly changed because our predictions are conditioned on the data setOr the training set. that we have observed. In other words: the observed data points restrict the set of possible functions for the Gaussian process!

Let me write that again. A Gaussian process assigns a prior probability to each of the infinite number of possible functions. The observations restrict the set of possible functions!

We are used to thinking about data being one dimensional in the \(x\)-direction and one dimensional in the \(y\)-direction. Instead, let us imagine that the \(y\) data points are drawn from one huge multivariate Gaussian distribution, where the dimensionality of the Gaussian is the same as the number of \(y\) data points you have. Then imagine that there is some covariance between those dimensions (those \(y\) data points) such that the value of one data point \(y_2\) depends on the value of other data points (i.e., \(y_1\)) and it depends on how far away \(x_1\) and \(x_2\) are. In a sense, nearby data points could be more correlated than distant points. If you described your covariance matrix in the right way, the covariance matrix (and a mean function) can fully describe your data without having to have an explicit functional form between \(x\) and \(y\) — because the Gaussian processs already includes all infinite number of possible functional forms!

Consider this below for just two data points \(x_1\) and \(x_2\):

Here, we have a two-dimensional normal distribution. Each dimension \(x_i\) is assigned an index \(i \in \{1, 2\}\). You can drag the handles to see how a particular sample (left) corresponds to functional values (right). This representation also allows us to understand the connection between the covariance and the resulting values: the underlying Gaussian distribution has a positive covariance between \(x_1\) and \(x_2\)  — this means that \(x_2\) will increases as \(x_1\) gets larger and vice versa. You can also drag the handles in the figure to the right and observe the probability of such a configuration in the figure to the left.

With a correctly defined covariance matrix you can predict the expectation value and variance of \(x_2\) given some observation \(x_1\). It's a little counter-intuitive to think about how the covariance matrix should look for various trivial problems (e.g., a straight line) but we will get there!

Kernels

We have shown that some data set can be described by the structure in the covariance matrix rather than an explicit functional relationship. But we have not described what kernel functions are available — only that they must construct a covariance matrix \(\vec{K}\). In general your kernel function only needs to produce a non-negative semi-definite covariance matrix, but in practice there are a few kernels that are used (or combined) for different problems.

Kernels are broadly described as either being stationary or non-stationary. Stationary kernels are functions that depend only on the radial distance between points in distance space, and non-stationary kernels are those that depend on the value of an input coordinate (e.g., to model an effect at a specific location in \(\vec{x}\)).

You will see that each kernel has parameters (or rather: hyperparameters) associated with them. We call them hyperparameters because they do not directly describe the data. Instead they describe the structure of the covariance matrix that describes the data. These hyperparameters control things like the magnitude of the variance, and the scale length over which points no longer become correlated. For periodic kernels, a periodic hyperparameter is obviously required. These hyperparameters become unknown variables of your model (like \(\vec{\theta}\)) that you can either optimise, or sample from. But you can see that just a few hyperparameters gives you a lot of model flexibility!

If you hadn't already thought about it, you should realise that even though the weight-space view showed us that we were incorporating a design matrix of infinite dimensions, one Gaussian process kernel function will not give infinite flexibility. For example, a stationary kernel has behaviour at a fixed location, and similarly a periodic kernel is explicitly necessary if you want to allow periodicity in the underlying systematic effects you are modelling. Thankfully, kernels can be added or multiplied to produce different effects.

An example: atmospheric carbon dioxide

Atmospheric CO2 levels is a common example used to show the combination of Gaussian process kernels to describe non-linear data. And while there are many great packages available that let you easily use Gaussian processes, this gives us the opportunity to introduce you to the excellent george package. Here we will exactly follow the example from the george documentation on Hyperparameter optimization.

First let's plot the data. import numpy as np import matplotlib.pyplot as plt from statsmodels.datasets import co2 data = co2.load_pandas().data t = 2000 + (np.array(data.index.to_julian_date()) - 2451545.0) / 365.25 y = np.array(data.co2) m = np.isfinite(t) & np.isfinite(y) & (t < 1996) t, y = t[m][::4], y[m][::4] fig, ax = plt.subplots(figsize=(4, 4)) ax.plot(t, y, ".k") ax.set_xlim(t.min(), t.max()) ax.set_xlabel(r"year") ax.set_ylabel(r"CO$_2$ in ppm") fig.tight_layout()

It is clear that there is a quasi-periodic signal here that is increasing with time. Instead of trying to find some functional relationship between time and CO2 level, following other examples we will use a zero-mean function and describe the data using a combination of multiple kernels \[ k(r) = k_1(r) + k_2(r) + k_3(r) + k_4(r) \] where \[ \begin{eqnarray} k_1(r) &=& \theta_1^2 \exp\left(-\frac{r^2}{2\theta_2}\right) \\ k_2(r) &=& \theta_3^2 \exp\left(-\frac{r^2}{2\theta_4} - \theta_5\sin^2\left(\frac{\pi{}r}{\theta_6}\right)\right) \\ k_3(r) &=& \theta_7^2 \left[1 + \frac{r^2}{2\theta_8\theta_9}\right]^{-\theta_8} \\ k_4(r) &=& \theta_{10}^2 \exp\left(-\frac{r^2}{2\theta_{11}}\right) + \theta_{12}^2\delta_{ij} \end{eqnarray} \]

Continuing to follow the george example, which is replicated here in parts, we will construct the kernels and optimise the hyperparameters from the result in Rasmussen & Williams: import george from george import kernels k1 = 66**2 * kernels.ExpSquaredKernel(metric=67**2) k2 = 2.4**2 * kernels.ExpSquaredKernel(90**2) * kernels.ExpSine2Kernel(gamma=2/1.3**2, log_period=0.0) k3 = 0.66**2 * kernels.RationalQuadraticKernel(log_alpha=np.log(0.78), metric=1.2**2) k4 = 0.18**2 * kernels.ExpSquaredKernel(1.6**2) kernel = k1 + k2 + k3 + k4 gp = george.GP(kernel, mean=np.mean(y), fit_mean=True, white_noise=np.log(0.19**2), fit_white_noise=True) gp.compute(t) print(gp.log_likelihood(y)) print(gp.grad_log_likelihood(y)) import scipy.optimize as op # Define the objective function (negative log-likelihood in this case). def nll(p): gp.set_parameter_vector(p) ll = gp.log_likelihood(y, quiet=True) return -ll if np.isfinite(ll) else 1e25 # And the gradient of the objective function. def grad_nll(p): gp.set_parameter_vector(p) return -gp.grad_log_likelihood(y, quiet=True) # You need to compute the GP once before starting the optimization. gp.compute(t) # Print the initial ln-likelihood. print(gp.log_likelihood(y)) # Run the optimization routine. p0 = gp.get_parameter_vector() results = op.minimize(nll, p0, jac=grad_nll, method="L-BFGS-B") # Update the kernel and print the final log-likelihood. gp.set_parameter_vector(results.x) print(gp.log_likelihood(y))

Now we will use the Gaussian process to make predictions for future observations, conditioned on what we have already observed, and conditioned on our optimised value for the hyperparameters: x = np.linspace(max(t), 2025, 2000) mu, var = gp.predict(y, x, return_var=True) std = np.sqrt(var) fig, ax = plt.subplots(figsize=(6, 6)) ax.scatter(t, y, s=1, c="k") ax.fill_between(x, mu+std, mu-std, color="g", alpha=0.5) ax.set_xlim(t.min(), 2025) ax.set_xlabel(r"year") ax.set_ylabel(r"CO$_2$ in ppm") fig.tight_layout()

Looks pretty good to me! But these predictions are conditioned on our point estimate of the kernel hyperparameters. We actually want to sample those parameters and make predictions based on the posterior distribution of those parameters. import emcee def lnprob(p): # Trivial uniform prior. if np.any((-100 > p[1:]) + (p[1:] > 100)): return -np.inf # Update the kernel and compute the lnlikelihood. gp.set_parameter_vector(p) return gp.lnlikelihood(y, quiet=True) gp.compute(t) # Set up the sampler. nwalkers, ndim = 36, len(gp) sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob) # Initialize the walkers. p0 = gp.get_parameter_vector() + 1e-4 * np.random.randn(nwalkers, ndim) print("Running burn-in") p0, _, _ = sampler.run_mcmc(p0, 200) print("Running production chain") sampler.run_mcmc(p0, 200); fig, ax = plt.subplots(figsize=(6, 6)) x = np.linspace(max(t), 2025, 250) for i in range(50): # Choose a random walker and step. w = np.random.randint(sampler.chain.shape[0]) n = np.random.randint(sampler.chain.shape[1]) gp.set_parameter_vector(sampler.chain[w, n]) # Plot a single sample. ax.plot(x, gp.sample_conditional(y, x), "g", alpha=0.1) ax.scatter(t, y, c="k", s=1) ax.set_xlim(t.min(), 2025) ax.set_xlabel(r"year") ax.set_ylabel(r"CO$_2$ in ppm") fig.tight_layout()

Well the predictions look sensible, albeit mostly depressing.

Summary

A Gaussian process assigns a prior probability to each of an infinite number of possible functions. Every observation restricts the set of possible functions. Gaussian processes have important applications in machine learning and data analysis more broadly, because they are both probabilistic (e.g., they preserve uncertainty), somewhat interpretable, and extremely flexible!

In the next class we are going to continue on advanced modelling techniques and introduce you to the concept of hierarchical models, and probabilistic graphical models.

 

← Previous class
Model comparison and decision making
Next class →
Hierarchical models

Contributions

The visualisations shown here come from the Distill article A Visual Exploration of Gaussian Processes by Jochen Görtler, Rebecca Kehlbeck, and Oliver Beussen.

The code for the CO2 example comes from the george documentation written by Dan Foreman-Mackey (Flatiron).