Fitting a model to data III

Data analysis and machine learning

In the previous class we introduced many concepts, including

Even though we did not go into much depth in each of these topics, we covered a lot of breadth.

Our model has much more flexibility than the model we introduced in lesson one. But is it flexible enough? Suppose you had some domain expertise that suggests that some of the measurements in the data table are actually outliers and are not drawn from a straight line relationship between \(x\) and \(y\)? In many practical settings someone might arbitrarily decide on a threshold of what 'is' or 'is not' an outlier, and then exclude those data points from being fit. Let's imagine that you are known for making arbitrary decisions, but you are a little more judicious than most, so you are going to call this threshold '\(5\sigma\)'. That is to say that any data points that are at least '\(5\sigma\)' away from the straight line are actually outliers and should be removed from the fit. Physicists use the term '\(5\sigma\)' a lot, particularly when claiming new discoveries that challenge our understanding of the Universe, and so you feel reasonably justified in taking this arbitrary threshold.

Now let's consider the practical consequences of your chosen threshold. What, exactly, constitutes \(5\sigma\) here? Do you take your least-squares fit from the first class and then calculate some metric \[ r_i = \frac{|y_i - mx_i - b|}{\sigma_{yi}} \] where \(|x|\) means the absolute value of \(x\), then do you just remove points where \(r_i > 5\)? What of the \(x\) uncertainties? What of the orthogonal distance from the line to the point (i. e., the total distance away from the line)? What of the intrinsic scatter in your model? Even if you do all these things and take the model from the previous class, when do you make the outlier removal? After your initial guess of model parameters? After optimisation? Do you do outlier removal just once or do you so iteratively (i. e., the initialisation-optimisation-remove outliers-optimisation process multiple times)? What about doing outlier removal just once after sampling the model parameters? What's the Right Thing™ to do?

We should first remind ourselves what an outlier is, exactly. An outlier is a data point that was not drawn from the model we are primarily interested in. So if we have ten data points that are described by a straight line and another two data points where we didn't actually measure the \(x\) and \(y\) values properly -- say we bumped the apparatus or accidentally wrote down some different measure \(z\) instead of \(y\) -- then those two data points are not drawn from our straight line model. They are not just 'significant deviations' from our straight line model; they simply are not drawn from our model.

All suggested approaches so far are wrong!

The intuitive reason is that we don't know with certainty as to whether any individual data point is an outlier or not. No matter when we were to decide to intervene and remove data points. We simply don't know whether a data point was truly drawn from the model we care about, or not. In practice every data point can be assigned some probability as to whether it is an outlier or not. Since we don't know what those probabilities are the Right Thing™ to do is to introduce model parameters that describe our ignorance, and marginalise them out.

Naively this could mean introducing \(N\) extra parameters to our model: one for each data point that describes whether it is an outlier or not. Including the line parameters \(m\) and \(b\), and the intrinsic scatter \(\lambda\), that totals 23 model parameters to fit a straight line to 20 data points.You may have to trust me here when I say that this is not as crazy as it appears.

Mixture models

Let's consider that model. We will introduce a model parameter \(q_i\) for every data point that will describe whether it is an outlier or not. Including the line parameters \(m\) and \(b\), and the intrinsic scatter \(\lambda\), we already have \(N + 3\) model parameters just to fit a straight line to \(N\) data points.If this seems crazy, just trust me temporarily. Let \(q_i\) be an integer flag where 0 indicates that the \(i\)-th data point is drawn from a straight line model, and 1 indicates that it was not (and so is therefore an outlier).

For simplicity we will first assume no uncertainties in the \(x\)-direction, and no intrinsic scatter. If the \(i\)-th data point is drawn from a straight line model then the expected frequency distribution is already familiar to us: \[ p(y_i|x_i,\sigma_{yi},m,b,q_i=0) = \frac{1}{\sqrt{2\pi\sigma_{yi}^2}}\exp\left(-\frac{\left[y_i - mx_i - b\right]^2}{2\sigma_{yi}^2}\right) \] But if \(q_i = 0\), which model are the outliers drawn from? This is a difficult question to answer pedagogically, but in practice your results won't be very sensitive to the choice of outlier model: you can usually just use a Gaussian with some large variance.Or some prior that forces the outlier model to have larger variance than your model of interest. In that case, \[ p(y_i|x_i,\sigma_{yi},\mu_{o},V_{o},q_i = 1) = \frac{1}{\sqrt{2\pi\left[\sigma_{yi}^2 + V_o\right]}}\exp\left(-\frac{\left[y_i - \mu_o\right]^2}{2\left[\sigma_{yi}^2 + V_{o}\right]}\right) \] where we have introduced the parameter \(\mu_o\) to describe the mean of the outlier model, and \(V_o\) as the variance in the outlier model.

If we (continue to) assume that all data points are independent, then under this model the full likelihood becomes \[ \newcommand\vectheta{\boldsymbol{\theta}} \newcommand\vecdata{\mathbf{y}} \newcommand\model{\boldsymbol{\mathcal{M}}} \newcommand\vecx{\boldsymbol{x}} \renewcommand{\vec}[1]{\boldsymbol{#1}} p(\vec{y}|\vec{x},\vec{\sigma_{y}},\vectheta,\vec{q}) = \prod_{i=1}^{N}p(y_i|x_i,\sigma_{yi},\vectheta, q_i) \] where \(\vectheta = \{m, b, \mu_o, V_o\}\). You can imagine that for each data point, the correct model (and likelihood function) is chosen depending on whether the value of \(q_i\) is 0 or 1.

In practice you will find this problem to be very difficult to optimise and sample. The number of possible combinations of integer parameters \(\vec{q}\) grows factorially, and integer optimisation is hard to begin with! And the number of model parameters scales as \(N + 4\), where \(N\) is the number of data points. That doesn't make things impossible, but it does mean that you're probably not going to have a good time.

An aside on model parameterisation

One model can be effectively parameterised in many different ways. The choice of parameterisation can have a significant effect on how difficult it is to do inference. The wrong parameterisation can make a model intractable, and the right parameterisation can make the exact same effective model trivially solvable with linear algebra. You should always think carefully about your model parameterisation.

The marginalised likelihood

Ideally we would like to marginalise out all our integer model parameters \(\vec{q}\). Recall that \[ p(\phi) = \int p(\phi|\theta)p(\theta) d\theta \] such that for a single data point \[ p(y_i|x_i,\sigma_{yi},\vectheta) = \int p(y_i|x_i,\sigma_{yi},\vectheta,q_i)p(q_i) dq_i \] the probability just becomes the summation \[ p(y_i|x_i,\sigma_{yi},\vectheta) = \sum_{q_i} p(y_i|x_i,\sigma_{yi},\vectheta,q_i)p(q_i) \] and for all data points: \[ p(\vec{y}|\vec{x},\vec{\sigma_{y}},\vectheta) = \prod_{i=1}^{N} \sum_{q_i} p(y_i|x_i,\sigma_{yi},\vectheta,q_i)p(q_i) \quad . \] But we still need a prior on \(\vec{q}\) to finish the marginalisation. We will assume the same prior on all data points, \[ p(q_i) = \begin{cases} Q & \textrm{if } q_i = 0 \\ 1 - Q & \textrm{if } q_i = 1 \end{cases} \] so that if a data point is drawn from the straight line then it has a prior probability of \(Q\), and if it is an outlier then it has a prior probability of \(1 - Q\), where \(Q \in [0, 1]\).

The likelihood then becomes \[ p(\vec{y}|\vec{x},\vec{\sigma_{y}},\vectheta) = \prod_{i=1}^{N}\left[p(q_i=0)p\left(y_i|x_i,\sigma_{yi},\vectheta, q_i=0\right) + p(q_i=1)p\left(y_i|x_i,\sigma_{yi},\vectheta, q_i=1\right)\right] \] \[ p(\vec{y}|\vec{x},\vec{\sigma_{y}},\vectheta) = \prod_{i=1}^{N}\left[\frac{Q}{\sqrt{2\pi\sigma_{yi}^2}}\exp\left(-\frac{\left[y_i - mx_i - b\right]^2}{2\sigma_{yi}^2}\right) + \frac{1-Q}{\sqrt{2\pi\left[\sigma_{yi}^2 + V_o\right]}}\exp\left(-\frac{\left[y_i - \mu_o\right]^2}{2\left[\sigma_{yi}^2 + V_o\right]}\right)\right] \] and voilĂ ! We have marginalised over all \(N\) of our \(\vec{q}\) parameters and replaced them with a single parameter, \(Q\). We've just reduced the number of parameters from \(N + 4\) to \(5\)!

Let's code up this model and see how our inferences change.

import numpy as np import scipy.optimize as op np.random.seed(0) _, x, y, y_err, x_err, rho_xy = np.array([ [1, 201, 592, 61, 9, -0.84], [2, 244, 401, 25, 4, +0.31], [3, 47, 583, 38, 11, +0.64], [4, 287, 402, 15, 7, -0.27], [5, 203, 495, 21, 5, -0.33], [6, 58, 173, 15, 9, +0.67], [7, 210, 479, 27, 4, -0.02], [8, 202, 504, 14, 4, -0.05], [9, 198, 510, 30, 11, -0.84], [10, 158, 416, 16, 7, -0.69], [11, 165, 393, 14, 5, +0.30], [12, 201, 442, 25, 5, -0.46], [13, 157, 317, 52, 5, -0.03], [14, 131, 311, 16, 6, +0.50], [15, 166, 400, 34, 6, +0.73], [16, 160, 337, 31, 5, -0.52], [17, 186, 423, 42, 9, +0.90], [18, 125, 334, 26, 8, +0.40], [19, 218, 533, 16, 6, -0.78], [20, 146, 344, 22, 5, -0.56], ]).T Y = np.atleast_2d(y).T A = np.vstack([np.ones_like(x), x]).T C = np.diag(y_err * y_err) C_inv = np.linalg.inv(C) G = np.linalg.inv(A.T @ C_inv @ A) X = G @ (A.T @ C_inv @ Y) def ln_prior(theta): b, m, Q, mu_o, ln_sigma_v = theta if not (1 > Q > 0) \ or not (700 > mu_o > 0) \ or not (5 > ln_sigma_v > -5): return -np.inf return -3/2 * np.log(1 + m**2) # We will describe the straight line model as 'foreground'. def ln_likelihood_fg(theta, x, y, y_err): b, m, *_ = theta # Note the extra (constant) term here that we previously have omitted. # Why is it important now? return -0.5 * ((y - m * x - b)**2 / y_err**2) - np.log(y_err) # And the outlier model as 'background'. def ln_likelihood_bg(theta, x, y, y_err): *_, Q, mu_o, ln_sigma_v = theta total_variance = y_err**2 + np.exp(ln_sigma_v)**2 return -0.5 * ((y - mu_o)**2 / total_variance) - 0.5 * np.log(total_variance) def ln_probability(theta, x, y, y_err): lp = ln_prior(theta) if not np.isfinite(lp): return (-np.inf, np.nan * np.ones((2, x.size))) b, m, Q, mu_o, ln_sigma_v = theta # Compute weighted foreground likelihoods for each data point. # (They are weighted by the log(Q) term) ll_fg = np.log(Q) + ln_likelihood_fg(theta, x, y, y_err) # Compute weighted background likelihoods for each data point. ll_bg = np.log(1 - Q) + ln_likelihood_bg(theta, x, y, y_err) # Sum the log of the sum of the exponents of the log likelihoods. ll = np.sum(np.logaddexp(ll_fg, ll_bg)) return (lp + ll, np.vstack([ll_fg, ll_bg])) # Optimize! args = (x, y, y_err) initial_theta = np.array([250, 1, 0.5, 400, -3]) result = op.minimize( lambda *args: -ln_probability(*args)[0], initial_theta, args=args, method="L-BFGS-B", bounds=[ (None, None), (None, None), (0, 1), (0, 700), (-5, 5) ] )

Even though we have a point estimate for the model parameters, we are getting to the stage of model complexity where it doesn't really give a lot of intuition to just plot the prediction from one point estimate. That's because there are many parameters with complex relationships between them, yet still few enough parameters where it does not cost us much to sample them. Let's go ahead and sample.

import emcee from corner import corner # Sample! ndim, nwalkers = (result.x.size, 64) p0 = [result.x + 1e-5 * np.random.randn(ndim) for k in range(nwalkers)] sampler = emcee.EnsembleSampler( nwalkers, ndim, ln_probability, args=args ) # Run the burn-in. pos, *_ = sampler.run_mcmc(p0, 1000) sampler.reset() # Run production. sampler.run_mcmc(pos, 2000) # Make a corner plot just showing m, b, Q # (the others are nuisance parameters to us) chain = sampler.chain.reshape((-1, ndim)) fig = corner( chain[:, :3], labels=(r"$b$", r"$m$", r"$Q$"), )

Where here we are only showing a `corner` plot for three of our parameters: \(m\), \(b\), and \(Q\). The parameters of the outlier distribution are so-called nuisance parameters, and we really only care about the marginalised likelihood of the line parameters \(m\) and \(b\), and to a lesser extent, the fraction of non-outliers \(Q\).

Note the scale of these axes. Using linear algebra in the first class the best-fitting model parameters were \(m = 1.07 \pm 0.08\) and \(b = 213.3 \pm 14.4\). The axes of the corner plot above do not even contain those values! Meaning that these posteriors are very different from the solution that we found through linear algebra.

Component membership probabilities

These inferences look sensible, even though they differ dramatically from what we found in an earlier class. The fraction of 'good points' seems to be at around \(Q \approx 0.75\), which seems plausible. So it seems that things look sensible. But what if we want to make some statements about whether an individual point is likely to be an outlier? Let's comment on that before we plot the posterior predictions.

What we actually want is the posterior probability that a data point belongs to the straight line model. That is, we want the probability of \(q_k=0\), given the data: \[ p(q_i|\vec{y},\vec{\sigma_{y}},\vec{x}) = \int p(q_i,\vectheta|\vecdata,\vec{\sigma_{y}},\vec{x})d\vectheta \] \[ p(q_i|\vec{y},\vec{\sigma_{y}},\vec{x}) = \int p(q_i|\vectheta,\vec{y},\vec{\sigma_{y}},\vec{x})p(\vectheta|\vec{y},\vec{\sigma_y},\vec{x})d\vectheta \] This looks a little nasty, but it's not really. In fact, we already have everything we need to calculate this. Before we get ahead of ourself, consider the conditional probability \[ p(q_i|\vectheta,\vecdata,\vec{\sigma_y},\vec{x}) = \frac{p(q_i)p(\vecdata|\vectheta,\vec{\sigma_y},\vec{x},q_i)}{\sum_k p(q_i=k)p(\vecdata|\vectheta,\vec{\sigma_y},\vec{x},q_i=k)} \] which only includes terms that we have already calculated. The numerator describes the posterior for the straight line model, and the denominator is the sum of the posterior probability of both components. But that is given some \(\vectheta\) values, and so we still need to integrate over \(d\vectheta\) \[ p(q_i|\vecdata,\vec{\sigma_y},\vec{x}) = \int p(q_i|\vecdata,\vec{\sigma_y},\vec{x},\vectheta)p(\vectheta|\vecdata,\vec{\sigma_y},\vec{x})d\vectheta \quad . \] However, we have already approximated this integral because we have \(M\) samples from the probability distribution \(p(\vectheta|\vecdata,\vec{\sigma_y},\vec{x})\), so we can just average the probabilities from those samples \[ p(q_i|\vecdata,\vec{\sigma_y},\vec{x}) = \int p(q_i|\vecdata,\vec{\sigma_y},\vec{x},\vectheta)p(\vectheta|\vecdata,\vec{\sigma_y},\vec{x})d\vectheta \approx \frac{1}{M}\sum_{m=1}^{M} p(q_i|\vecdata,\vec{\sigma_y},\vec{x},\vectheta^{(m)}) \] such that \[ p(q_i|\vecdata,\vec{\sigma_y},\vec{x}) \approx \frac{1}{M}\sum_{m=1}^{M} \frac{p(q_i)p(\vecdata|\vectheta^{(m)},\vec{\sigma_y},\vec{x},q_i)}{\sum_{k}p(q_i=k)p(\vecdata|\vectheta^{(m)},\vec{\sigma_y},\vec{x},q_i=k)} \quad . \]

Let's code this up: # Calculate posterior probability estimates for each data point. q = np.zeros_like(x) for i in range(sampler.chain.shape[1]): for j in range(sampler.chain.shape[0]): ll_fg, ll_bg = sampler.blobs[i][j] q += np.exp(ll_fg - np.logaddexp(ll_fg, ll_bg)) q /=[:2])

Now we have the \(\vec{q}\) values for each data point, except these are not integers: they are floats that vary from 0 (outlier) to 1 (not an outlier). Let's use these values to shade individual points by their probability of being drawn from the straight-line model, as well as our posterior predictions of the model.

from matplotlib.ticker import MaxNLocator # Make posterior predictions. fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter(x, y, c=q, edgecolor="k", lw=1, cmap="Greys", s=20) ax.errorbar(x, y, yerr=y_err, fmt='none', lw=1, c="k", zorder=-1) xlim = np.array([0, 300]) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(7)) # Plot draws of the posterior. for index in np.random.choice(chain.shape[0], size=100): b, m = chain[index, :2] ax.plot( xlim, m * xlim + b, "-", c="tab:orange", alpha=0.2, lw=0.5, zorder=-100 ) ax.set_xlim(*xlim) ax.set_ylim(0, 700) fig.tight_layout()

Finally. That looks much more pleasing to the eye.And our inference is quite objective; there is very little arbitrariness! This comment of looking 'pleasing to the eye' is not an encouragement to do '\(\chi\)-by-eye' fitting. Never do that. And in this model there is much that we haven't done. For example, this model only includes errors in the \(y\)-direction, although we know there are uncertainties in the \(x\)-direction. (We are, by extension, also ignoring any covariance between the \(x\)- and \(y\)-direction uncertainties. And we're not modelling any intrinsic scatter: we are still assuming that the data are drawn from an infinitely narrow straight line, and any deviations from that line are due to the \(y\)-direction uncertainty only, or, perhaps that data point is drawn from the outlier distribution. You can see that we could make this model more complicated, even though the inference we see immediately above looks reasonable (unlike many of the other model variations we tried!)

The process that we went through this week is a typical one for fitting a model to data. Although we have used the most simplistic model (a straight line) as a baseline example, we have explored how complexity can be added. We've covered Bayesian statistics, priors, marginalisation, optimisation, sampling, and posterior predictions.

In the next few classes we will explore optimisation and sampling in detail. These are important topics that are easily abused.


← Previous class
Fitting a model to data II
Next class →


The example data table was taken from Hogg et al. (2010).

The mixture models blog post written by Dan Foreman-Mackey (Flatiron) was extremely useful for clarifying terminology and nomenclature.