Fitting a model to data II

Data analysis and machine learning

In the previous class we introduced the simplest problem of fitting a model to data. That is one where you have precisely known \(x\)-values (with zero uncertainty), and you have associated \(y\)-values with reliable \(y\)-directional uncertainties, and you believe that the \(y\) values are truly drawn from a narrow straight line.

Assuming that the uncertainties in the \(y\)-direction are Gaussian distributed, we derived the frequency distribution for the \(y\) values, and showed how -- in this example -- maximizing the log likelihood \(\log\mathcal{L})\) is equivalent to minimizing the \(\chi^2\) value, which is familiar to many of us. However, this is not the complete picture.

What if the \(x\) values have some uncertainty to them? It means that we don't know what the true \(x\) values are, and when we were previously calculating our \(y\) values we are assuming that we precisely knew the \(x\) values! What if we have some prior belief about the model parameters \(m\) and \(b\)?

How do we extend our model to incorporate more complexity? How do we incorporate our prior beliefs in a rigorous way? How do we account for uncertainties in the \(x\) and \(y\) directions?

Bayes' theorem

First we will introduce the Bayesian generalisation of this problem. In Bayesian inference the log likelihood \(\log\mathcal{L}\) is only one part of the problem, which quantifies the likelihood of the data, given the model. However, before evaluating any likelihood we often have some idea, or prior information about what reasonable values our parameters can have. Better yet, we may have good physical reasons for how they should be distributed, or how they should vary with each other. In Bayesian inference we incorporate that prior information in a justified, rigorous way.

Bayes' theorem is nothing complex, yet it is a miraculous equation with counter-intuitive consequences. Bayes' theorem follows directly from the rules of conditional probability, and it states that \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \] the probability of an event \(A\) occurring given event \(B\) occurring \(P(A|B)\), is equal to the probability of event \(B\) occurring given event \(A\), \(P(B|A)\), multiplied by the probability that event \(A\) will occur \(P(A)\), all divided by the probability that event \(B\) will occur \(P(B)\).

Why is this special?

\(A\) and \(B\) don't have to be events per se. It might be easier to see how important this is if we re-cast these probabilities in terms of our unknown model parameters \[ \newcommand\vectheta{\boldsymbol{\theta}} \newcommand\vecdata{\mathbf{y}} \newcommand\model{\boldsymbol{\mathcal{M}}} \newcommand\vecx{\boldsymbol{x}} \vectheta = \{m, b\} \] and the data \(\mathbf{y}\), \[ P(\vectheta|\vecdata,\model) = \frac{P(\vecdata|\vectheta,\model)P(\vectheta|\model)}{P(\vecdata|\model)} \] where \(\model\) describes our knowledge of \(\vecx\), \(\mathbf{\sigma_y}\), and everything else about the model.This is often denoted simply as \(\vecx,\mathbf{\sigma_{y}}\) but it is useful to remember that when we are calculating conditional probabilities, those probabilities are conditioned on everything we believe about the model. That includes all our implicit (and explicit) assumptions about the model, and the Universe. When we get into discussing model selection you will see that if you are comparing classes of models (e.g., gaussian mixture models) then your probabilities are conditioned on everything you assume about the model. You may find it useful to literally speak out the equation aboveRecall that the \(|\) term means "given"., \[ P(\textrm{model}|\textrm{data}) = \frac{P(\textrm{data}|\textrm{model})P(\textrm{model})}{P(\textrm{data})} \quad . \]

Now you can see that we have a justified way of calculating the probability of a model, given some data, that can include some prior beliefs about that model.

Using our shorthand nomenclature for \(\vectheta\) and \(\model\), you should recognise the \(P(\vecdata|\vectheta,\model)\) term from the previous class, where we introduced the frequency distribution \(p(y_i|x_i,\sigma_{yi},m,b)\) for a single data point \(y_i\) \[ p(y_i|x_i,\sigma_{yi},m,b) = \frac{1}{\sqrt{2\pi\sigma_{yi}^2}}\exp\left(-\frac{\left[y_i-mx_i - b\right]^2}{2\sigma_{yi}^2}\right) \quad . \] Because we assume that each data point is independent, the \(P(\vecdata|\vectheta,\model)\) term is the product of individual probabilities: \[ P(\vecdata|\vectheta,\model) = \prod_{i=1}^{N} p(y_i|x_i,\sigma_{yi},\vectheta) \quad . \]

But what of the other terms?

The left-hand side of Bayes' theorem is usually what you want, because you want to know the probability of the model, given some data. The denominator on the right-hand side of Bayes' theorem \(P(\vecdata|\model)\) (or just \(P(\vecdata)\) for short) has many namesThe most common term is also the most misleading: the evidence. People like this term because the quotient \(\frac{P(\textrm{data}|\textrm{model})}{P(\textrm{data})}\) represents the support that the data provides for the model. People hate this term in part because it's not specific terminology, and there are many ways to confuse oneself., but throughout this course we will describe it as the fully marginalised likelihood.You will remember it as fully marginalised likelihood, or FML, because that is exactly how you will feel if you have to calculate it (see Contributions below).

The fully marginalised likelihood is a normalisation factor to ensure all your probability density sums to unity. It can be extremely expensive (or intractable) to calculate the fully marginalised likelihood, and in most applications you don't necessarily need it. If you only want to know what the best model parameters are, and perhaps what the uncertainties on those model parameters look like, then you don't need the fully marginalised likelihood. And most times when people think they need the fully marginalised likelihood, they don't.

So, for now, we ignore it. We will come back to the fully marginalised likelihood in later classes. But how, exactly, do we get away with ignoring it? We saw this in the previous lecture when we considered what would happen if we multiplied many near-zero probabilities together. If we calculated that on a computer we would quickly reach arithmetic underflow. Instead of multiplying many near-zero values together, we sum the logarithms of those numbers \[ \log{P(\vectheta|\vecdata,\model)} = \log{P(\vecdata|\vectheta,\model)} + \log{P(\vectheta|\model)} - \log{P(\vecdata|\model)} \quad . \] We can see that \(\log{P(\vecdata|\model)}\) is a constant that does not change with \(\vectheta\), so we can re-write this as \[ \begin{array}{ccccc} \log{P(\vectheta|\vecdata,\model)} & \propto & \log{P(\vecdata|\vectheta,\model)} & + & \log{P(\vectheta|\model)} \quad . \\ \\ \uparrow && \uparrow && \uparrow \\ \textrm{log posterior} && \textrm{log likelihood} && \textrm{log prior} \\ \textrm{probability} && && \textrm{probability} \end{array} \]

So you see that what we actually want is the log posterior probability, which is the sum of the log likelihood (as we introduced last class) and the log prior probability. If you had no prior information then the log posterior probability is exactly equivalent to the log likelihood. And usually the prior is weak relative to the likelihood. But we almost always have prior information that we can include. How do we include those prior beliefs?

Priors

To introduce a prior on some parameter we must describe what we believe the probability density looks like for that parameter. For example, the most non-informativeOr un-inform-ative, uniform-ative, uniform. prior is a uniform prior between two values. If we were to say the prior on some parameter \(\phi\) is uniform between \(a\) and \(b\) \[ \phi \sim \mathcal{U}(a, b) \] then we are saying that \(\phi\) is drawn from some uniform distribution between \(a\) and \(b\). It also means that \(\phi\) cannot be less than \(a\) or more than \(b\). The probability density function for a uniform distribution is \(\frac{1}{b-a}\) for \(x \in [a, b]\), or 0 otherwise, such that \(\log{P(\phi)} = -\log{(b-a)}\) (i.e.,a constant value) if \(x \in [a, b]\) and \(\log{P(\phi)} = -\infty\) otherwise.

Uniform priors are not always a good thing. You may instead believe that some parameter is 'of order X' and describe the prior as a Gaussian that is centered on X with some large variance. That may not be a good prior, either. It's important to think about what your choice of prior means, and the impact it has on your model.

Let's do that. This process is called prior predictive checks because we are going to check that our prior predictions are sensible (and match what we expect). Let us first assume that we have a uniform prior on the slope \(m\) \[ m \sim \mathcal{U}(0, 10) \] and now we will sample that distribution and plot the density in \((x, y)\) space. import numpy as np import matplotlib.pyplot as plt from matplotlib.ticker import MaxNLocator x = np.linspace(-1, 1) fig, ax = plt.subplots(figsize=(4, 4)) for m in np.linspace(0, 10, 100): ax.plot(x, m*x, "-", c="k", lw=0.5, alpha=0.5) ax.set_xlim(-1, 1) ax.set_ylim(-1, 1) ax.xaxis.set_major_locator(MaxNLocator(4)) ax.yaxis.set_major_locator(MaxNLocator(4)) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") fig.tight_layout()

The lines on this figure have evenly spaced differences in slope \(m\), but they bunch up together (at angles of \(\pm\frac{\pi}{2}\)). So even though our uniform prior is giving equal possibility to each one of these lines, it's clear that a flat prior ends up favouring very steep slopes! Who ordered that?!

For good reasons outside the scope of this class, we will use a joint prior on \(m\) and \(b\) of \[ p(m,b) \propto (1 + m^2)^{-3/2} \] which is equivalent to stating that \(b\) is uniformly distributed, and \(m\) is uniformly distributed in \(\sin\theta\) where \(\theta = \tan^{-1}m\).

One thing you should always remember is that your priors need to be defined before you looked at the data. You cannot look at the data -- or worse, see what kind of model parameters make a reasonable fit -- and then decide what kind of prior you want to include. Otherwise you will have evidence from the sample data coming through both the likelihood and the prior.There are subtleties to this. On Earth the first detected gravitational wave had a very high signal-to-noise ratio. There was little to no doubt that gravitational waves had been detected. Let's hypothesise that no gravitational waves were detected in the year following. Then, nearly a year later, a very weak gravitational wave is detected. By the time the weak signal is detected we have a strong prior that gravitational waves do exist, so that if we make a marginal detection, we can be a little more confident that it is indeed real.

Now consider a parallel universe where the loud gravitational signal came second, and the weak one came first. When the weak one arrived we may have flustered over it's significance for a year before telling anyone, because our prior was that gravitational waves should exist, but we didn't know for certain.

The \(x\) values now have uncertainties, too!

Let's make things a little more complicated.

The data now have uncertainties in the \(x\)-direction and in the \(y\)-direction. For now we will assume that the uncertainties in each direction are independent (uncorrelated) of each other. How do we include these new uncertainties?

Let's plot this example, where we have uncorrelated uncertainties in \(x\) and \(y\), using the same data set as the previous class.

import numpy as np import matplotlib.pyplot as plt from matplotlib.ticker import MaxNLocator from matplotlib.patches import Ellipse # Load the data. _, x, y, y_err, x_err, rho_xy = data = np.array([ [1, 201, 592, 61, 9, -0.84], [2, 244, 401, 25, 4, +0.31], [3, 47, 583, 38, 11, +0.64], [4, 287, 402, 15, 7, -0.27], [5, 203, 495, 21, 5, -0.33], [6, 58, 173, 15, 9, +0.67], [7, 210, 479, 27, 4, -0.02], [8, 202, 504, 14, 4, -0.05], [9, 198, 510, 30, 11, -0.84], [10, 158, 416, 16, 7, -0.69], [11, 165, 393, 14, 5, +0.30], [12, 201, 442, 25, 5, -0.46], [13, 157, 317, 52, 5, -0.03], [14, 131, 311, 16, 6, +0.50], [15, 166, 400, 34, 6, +0.73], [16, 160, 337, 31, 5, -0.52], [17, 186, 423, 42, 9, +0.90], [18, 125, 334, 26, 8, +0.40], [19, 218, 533, 16, 6, -0.78], [20, 146, 344, 22, 5, -0.56], ]).T fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter(x, y, c="k", s=10) ax.errorbar(x, y, xerr=x_err, yerr=y_err, fmt="o", lw=1, c="k") ax.set_xlim(0, 300) ax.set_ylim(0, 700) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(7)) fig.tight_layout()

If we assume that the uncertainties in the \(x\)-direction are Gaussian distributed then the \(x_i\) value is drawn from a normal distribution (denoted with \(\mathcal{N}\)) that is centered on the true, unknown value \(\hat{x}_i\), with some standard deviation \(\sigma_{xi}\) \[ x_i \sim \mathcal{N}(\hat{x}_i, \sigma_{xi}) \quad . \]

If we were only considering the \(x\) values then the frequency distribution is already familiar to us, \[ p(x_i|\hat{x}_i,\sigma_{xi}) = \frac{1}{\sqrt{2\pi\sigma_{xi}^2}}\exp{\left(-\frac{\left[x_i - \hat{x}_i\right]^2}{2\sigma_{xi}^2}\right)} \quad . \] Because we assume the uncertainties in the \(x\)-direction and the \(y\)-direction are uncorrelated, we can calculate the joint probability of \(P(y_{i},x_{i}|\model)\) using the probability chain rule \[ P(A,B|C) = P(A|B,C)P(B|C) \] which in this context implies \[ p(y_i,x_i|\model) = p(y_i|x_i,\model)p(x_i|\model) \] and for the \(i\)-th data point, \[ p(y_i,x_i|\hat{x}_i,\sigma_{xi},\sigma_{yi},\vectheta) = p(y_i|\hat{x}_i,\sigma_{yi},\vectheta)p(x_i|\hat{x}_i,\sigma_{xi}) \quad . \]

There are a few comments worth making here. Note that the unknown true values \(\hat{x}_i\) have been lumped in with our unknown model parameters \(\vectheta\). And unlike the \(\sigma_{yi}\) values (which are known), we don't know what the \(\hat{x}_i\) values are! We have just effectively introduced \(N\) unknown parameters to represent the true values \(\hat{\vecx}\). That seems totally crazy because we have increased the number of unknown parameters from \(2\) to \(2 + N\), and all we want to do is fit a line! Now we have more unknown model parameters than we have data points! But actually, this is the simplest way to incorporate our uncertainty in the \(x\) positions, because we will marginalise over the true values \(\hat{x}_i\) (i.e., we will integrate them out) so that our constraints on \(\vectheta\) are marginalised over all possible \(\hat{\vecx}\) values.In other words: it doesn't matter what the true \(\hat{\vecx}\) values are; we have considered the probability for all \(\hat{x}_i\) values between \(-\infty\) and \(+\infty\). Don't worry if this part isn't clear to you yet. It will be.

Let's consider a single data point \((x_i, y_i)\) and marginalise over \(\hat{x}_i\) for that point. Recall that \[ p(\phi) = \int p(\phi|\theta)p(\theta) d\theta \] and for a single data point this integral brings us back to the two-dimensional case \[ p(y_i,x_i|\sigma_{xi},\sigma_{yi},\vectheta) = \int p(y_i,x_i|\hat{x}_i,\sigma_{xi},\sigma_{yi},\vectheta) p(\hat{x}_i) d\hat{x}_i \] where \(p(\hat{x}_i)\) is our prior on the true value \(\hat{x}_i\). We will assume an improper uniform priorAn improper prior is one that does not have a finite integral. going from \(-\infty\) to \(+\infty\), but the same conclusion can be drawn by assuming a Gaussian prior on \(\hat{x}_i\) with a very large variance.

The resulting integral is a magical consequence of the fact that the convolution of a Gaussian with a Gaussian is a GaussianYou'd be surprised how useful it is to remember this fact. \[ p(y_i,x_i|\sigma_{xi},\sigma_{yi},\vectheta) \propto \frac{1}{\sqrt{2\pi(\sigma_{xi}^2 + \sigma_{yi}^2)}}\exp{\left(-\frac{(y_i - mx_i - b)^2}{2(\sigma_{xi}^2 + \sigma_{yi}^2)}\right)} \] and now we have an expected frequency distribution for \((x_i,y_i)\) where all the true unknown parameters \(\hat{x}_i\) have been marginalised away! Including our priors on \(p(m,b)\) from earlier, the log posterior probability becomes \[ \log{p(\vectheta|\vecdata,\vecx,\boldsymbol{\sigma_{x}},\boldsymbol{\sigma_{y}})} \propto -\frac{1}{2}\sum_{i=1}^{N} \frac{(y_i - mx_i - b)^2}{\sigma_{xi}^2 + \sigma_{yi}^2} - \frac{3}{2}\log{(1 + m^2)}\quad . \]

Incredible, right!? Even with our prior, the \(x\) uncertainties enter very simply into our log probability function, yet this is only because the \(x\) and \(y\) uncertainties are uncorrelated! But how would we find the best fitting model parameters? We can't solve this function using linear algebra like we did in the previous class. Instead we will minimise the negative log likelihoodYou might think: why don't we just maximize the log likelihood instead of minimize the negative log likelihood? The answer is incredibly boring: most optimization algorithms have as their objective to minimize some value, rather than maximize some value. So instead of maximizing the log likelihood, we will minimize the negative log likelihood. We will discuss optimisiation in more detail in a later class, so for now I will give you an example of how this is done in Python.

import numpy as np import matplotlib.pyplot as plt import scipy.optimize as op from matplotlib.ticker import MaxNLocator from matplotlib.patches import Ellipse _, x, y, y_err, x_err, rho_xy = np.array([ [1, 201, 592, 61, 9, -0.84], [2, 244, 401, 25, 4, +0.31], [3, 47, 583, 38, 11, +0.64], [4, 287, 402, 15, 7, -0.27], [5, 203, 495, 21, 5, -0.33], [6, 58, 173, 15, 9, +0.67], [7, 210, 479, 27, 4, -0.02], [8, 202, 504, 14, 4, -0.05], [9, 198, 510, 30, 11, -0.84], [10, 158, 416, 16, 7, -0.69], [11, 165, 393, 14, 5, +0.30], [12, 201, 442, 25, 5, -0.46], [13, 157, 317, 52, 5, -0.03], [14, 131, 311, 16, 6, +0.50], [15, 166, 400, 34, 6, +0.73], [16, 160, 337, 31, 5, -0.52], [17, 186, 423, 42, 9, +0.90], [18, 125, 334, 26, 8, +0.40], [19, 218, 533, 16, 6, -0.78], [20, 146, 344, 22, 5, -0.56], ]).T # Calculate the total inverse variances because we will use them often. # Having inverse variances means we can do multiplication (faster) # instead of division (slower), and inverse variance has nicer properties # than 'standard deviation', 'variance', and 'inverse standard deviation'. # For example: inverse variance of zero means no information, whereas to # do the same thing with standard deviation would mean we'd have to set # the standard deviation to an infinitely large value, which could introduce # numerical problems. xy_ivar = 1/(x_err**2 + y_err**2) def ln_prior(theta): b, m = theta return -3/2 * np.log(1 + m**2) def ln_likelihood(theta, x, y, xy_ivar): b, m = theta return -0.5 * np.sum((y - m * x - b)**2 * xy_ivar) def ln_probability(theta, x, y, xy_ivar): return ln_prior(theta) + ln_likelihood(theta, x, y, xy_ivar) nlp = lambda *args: -ln_probability(*args) # This is a simple model so initialisation is not that important, # but let's be kind to computers and initialise from the linear # algebra solution from the previous class. Y = np.atleast_2d(y).T A = np.vstack([np.ones_like(x), x]).T C = np.diag(y_err * y_err) C_inv = np.linalg.inv(C) G = np.linalg.inv(A.T @ C_inv @ A) X = G @ (A.T @ C_inv @ Y) initial_theta = X.T[0] args = (x, y, xy_ivar) # Optimize! # Remember: we want to minimize the negative log probability. result = op.minimize(lambda *args: -ln_probability(*args), initial_theta, args=args, method="L-BFGS-B")

Ok, let's plot the optimised solution and see how it compares to our linear algebra result from the previous class. # Plot the data. fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter(x, y, c="k", s=10) ax.errorbar(x, y, xerr=x_err, yerr=y_err, fmt="o", lw=1, c="k") ax.set_xlim(0, 300) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(7)) # Plot the initial guess. xlim = np.array(ax.get_xlim()) ax.plot(xlim, initial_theta[1] * xlim + initial_theta[0], "-", ls=":", c="#666666", lw=1, zorder=-1) # Plot the optimized value ax.plot(xlim, result.x[1] * xlim + result.x[0], "-", c="tab:red", lw=2, zorder=-2) ax.set_xlim(*xlim) ax.set_ylim(0, 700) fig.tight_layout()

It would seem that our prior and the \(x\) uncertainties have made very little difference to our point estimate of the line parameters. The linear algebra solution from lesson one is shown in grey and our new estimate is shown in red. But this is just a point estimate: a single value estimate of the model parameters. It's possible that our uncertainties on the model parameters have now changed since including the prior on \(m\) and the \(x\) uncertainties. The prior we have adopted means we can't estimate the uncertainties from linear algebra, but we can estimate the uncertainties using Markov Chain Monte Carlo (MCMC) sampling. We will cover MCMC in a later class, so for now you can just think of it as 'an expensive way to estimate uncertainties.'

import emcee from corner import corner # Let's sample the posterior to estimate our uncertainties on the model parameters. ndim, nwalkers = (result.x.size, 32) p0 = [result.x + 1e-5 * np.random.randn(ndim) for k in range(nwalkers)] sampler = emcee.EnsembleSampler( nwalkers, ndim, ln_probability, args=args ) # Run the burn-in. pos, *_ = sampler.run_mcmc(p0, 500) sampler.reset() # Run production. sampler.run_mcmc(pos, 1000) # Make a corner plot. chain = sampler.chain.reshape((-1, ndim)) fig = corner( chain, labels=(r"$b$", "$m$") ) # Add the point estimate we found from optimisation as a red dot. ax = fig.axes[2] ax.scatter(*result.x, s=20, facecolor="tab:red", zorder=100) def _ellipse(x, y, cov, scale=2, facecolor="none", **kwargs): vals, vecs = np.linalg.eig(cov) theta = np.degrees(np.arctan2(*vecs[::-1, 0])) w, h = scale * np.sqrt(vals) kwds = dict(lw=0.5) kwds.update(**kwargs) ellipse = Ellipse(xy=[x, y], width=w, height=h, angle=theta, **kwds) ellipse.set_facecolor(facecolor) return ellipse # Add our linear algebra estimate as a blue point, with contours. b, m = X.T[0] ax.scatter([b], [m], s=20, facecolor="tab:blue", zorder=100) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=0, alpha=0.5, facecolor="tab:blue", color="tab:blue", zorder=10)) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=2, facecolor="none", color="tab:blue", zorder=60))

Our uncertainties have changed! Introducing the \(x\) direction uncertainties has clearly had a large effect on our inferences about the line parameters \(m\) and \(b\).Ask yourself: why are we so confident that it was the \(x\) direction uncertainties that has inflated our parameter uncertainties? Could the prior on \(m\) have had the same effect? Could you calculate the effect? The two estimates are still very consistent, but it's clear that our newer uncertainties are perhaps a bit more honest. That's good!

Let's plot draws from the posterior. fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter(x, y, c="k", s=10) ax.errorbar(x, y, xerr=x_err, yerr=y_err, fmt="o", lw=1, c="k") xlim = np.array([0, 300]) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(7)) # Plot draws of the posterior. for index in np.random.choice(chain.shape[0], size=100): b, m = chain[index] ax.plot( xlim, m * xlim + b, "-", c="tab:red", alpha=0.1, lw=0.5, zorder=-1 ) ax.set_xlim(*xlim) ax.set_ylim(0, 700) fig.tight_layout()

What do you think about this? Has including a prior on \(m\) and the uncertainties in the \(x\) direction made for a better model? Is it flexible enough? Probably not, since we know that the \(x\) and \(y\) truly aren't independent of each other.

Arbitrary two-dimensional uncertainties

Now we will consider the situation where the uncertainties in \(x\) and \(y\) are correlated. For a single data point \((x_i, y_i)\) with \(x\)-direction uncertainties \(\sigma_{xi}\), \(y\)-direction uncertainties \(\sigma_{yi}\), and a correlation coefficient \(\rho_{xy}\), the full covariance matrix is \[ \mathbf{C}_i \,=\, \left[\begin{array}{cc} \sigma_{xi}^2 & \rho_{xy,i}\sigma_{xi}\sigma_{yi} \\ \rho_{xy,i}\sigma_{xi}\sigma_{yi} & \sigma_{yi}^2 \end{array} \right] \quad . \] There are a few ways you could address this problem. We will start by introducing the residual vector \(\mathbf{R}_i\) for the \(i\)-th data point \[ \mathbf{R}_i \,=\, \left[ \begin{array}{c} x_i - \hat{x}_i \\ y_i - m\hat{x}_i - b \end{array} \right] \] such that the probability density function is \[ \newcommand{\transpose}{^{\scriptscriptstyle \top}} p(\vecdata,\vecx|m,b,\hat{\vecx},\mathbf{C}) = \prod_{i=1}^{N}\frac{1}{2\pi\sqrt{\det{\left(\mathbf{C}_i\right)}}}\exp{\left(-\frac{1}{2}{\mathbf{R}_i}\transpose{\mathbf{C}_i}^{-1}\mathbf{R}_i\right)} \] where we regretably overload our nomenclature by describing \(\mathbf{C}\) as the set of all covariance matrices, whereas \(\mathbf{C}_i\) describes the \(i\)-th covariance matrix.

Like earlier, we integrate over the true \(\hat{x}_i\) values for a single data point \[ p(y_{i},x_{i}|m,b,\mathbf{C}_n) = \int_{-\infty}^{+\infty} p(\hat{x}_{i})p(y_i,x_i|m,b,\hat{x}_i,\mathbf{C}_i)d\hat{x}_i \] using an (improper) uniform prior \[ p(y_i,x_i|m,b,\mathbf{C}_i) = \int_{-\infty}^{+\infty}\frac{1}{2\pi\sqrt{\det{\left(\mathbf{C}_i\right)}}}\exp\left(-\frac{1}{2}{\mathbf{R}_i}\transpose\mathbf{C}^{-1}\mathbf{R}_{i}\right)d\hat{x}_i \] such that the marginalised probability density is \[ p(y_i,x_i|m,b,\mathbf{C}_i) \propto \frac{1}{\sqrt{2\pi{\Sigma_i}^2}}\exp{\left(-\frac{{\Delta_i}^2}{2{\Sigma_i}^2}\right)} \] where \[ \begin{array}{ccl} \Delta_i &=& y_i - mx_i - b\\ {\Sigma_i}^2 &=& \left(\sigma_{xi}m\right)^2 - 2m\rho_{xy,i}\sigma_{xi}\sigma_{yi} + \sigma_{yi}^2 \end{array} \] or \[ {\Sigma_i}^2 \,=\, \mathbf{V}\transpose\mathbf{C}_i\mathbf{V} \] where \(\mathbf{V}\transpose = \left[-m \,\,\,1\right]\) is a projection vector.

Let's use the same prior on \(p(m,b)\) as earlier and sample the marginalised posterior probability (marginalised over \(\hat{x}_i\) values).

# Construct the covariance matrices that we'll need. covs = np.array([[[x_e**2, x_e*y_e*rho], [x_e*y_e*rho, y_e**2]] for y_e, x_e, rho in zip(y_err, x_err, rho_xy)]) # Same prior as before, but re-define for fun. def ln_prior(theta): b, m = theta return -3/2 * np.log(1 + m**2) def ln_likelihood(theta, x, y, C): b, m = theta # Define projection vector. V = np.array([[-m, 1]]).T Delta = (y - m * x - b) Sigma = (V.T @ C @ V).flatten() # Dropping constant terms. return np.sum(-np.log(Sigma) - 0.5 * Delta**2 / Sigma) def ln_probability(theta, x, y, C): return ln_prior(theta) + ln_likelihood(theta, x, y, C) # Optimize! (from linear algebra solution again) args = (x, y, covs) result = op.minimize(lambda *args: -ln_probability(*args), X.T[0], args=args, method="L-BFGS-B") # Sample! ndim, nwalkers = (result.x.size, 32) p0 = [result.x + 1e-5 * np.random.randn(ndim) for k in range(nwalkers)] sampler = emcee.EnsembleSampler( nwalkers, ndim, ln_probability, args=args ) # Run the burn-in. pos, *_ = sampler.run_mcmc(p0, 500) sampler.reset() # Run production. sampler.run_mcmc(pos, 1000) # Make a corner plot. chain = sampler.chain.reshape((-1, ndim)) fig = corner( chain, labels=(r"$b$", "$m$") ) ax = fig.axes[2] ax.scatter(*X.T[0], s=20, facecolor="tab:blue", zorder=100) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=0, alpha=0.5, facecolor="tab:blue", color="tab:blue", zorder=10)) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=2, facecolor="none", color="tab:blue", zorder=60)) fig.tight_layout()

Wow! Compare the posteriors we get on \(m\) and \(b\) by including arbitrary uncertainties (i. e., the correlations between \(x\) and \(y\)) with the uncertainties we found in lesson one using linear algebra (blue). Now the two sets are barely consistent with each other!

Let's make posterior predictions of the line by drawing samples from the posterior and plotting the model predictions. # Make posterior predictions. fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter(x, y, c="k", s=10) # We are using arbitrary covariance matrices, so we should show # error ellipses instead of error bars. for xi, yi, cov in zip(x, y, covs): ax.add_artist(_ellipse(xi, yi, cov, scale=2, color="k")) xlim = np.array([0, 300]) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(7)) # Plot draws of the posterior. for index in np.random.choice(chain.shape[0], size=100): b, m = chain[index] ax.plot( xlim, m * xlim + b, "-", c="tab:green", alpha=0.2, lw=0.5, zorder=-1 ) ax.set_xlim(*xlim) ax.set_ylim(0, 700) fig.tight_layout()

How else do these posteriors vary from what found in the last class from linear algebra? How do these posteriors change from what we did not include correlations in the \(x\) and \(y\) directions? Intuitively, what would you expect when you include, or do not include, correlations between uncertainties?

Intrinsic scatter

So far we have shown how to incorporate uncertainties in the \(x\) and \(y\) directions in different cases. The simpler case is when the uncertainties between \(x\) and \(y\) are uncorrelated, and in the more complex case the uncertainties are correlated. But in both cases we are still assuming that the data are generated from an infinitely narrow straight line, where any deviations from that line are entirely due to the uncertainties in the measurements.

What if that is not the case? What if the \(y\) values are described by a straight line relationship that depends on \(x\), but there is some intrinsic scatter, or jitter, to that relationship, such that deviations from the line are not only dependent on the measurement uncertainties?

Consider this possibility carefully. We're about to go through the math for this model, but first you should ask: How would I put a prior on the intrinsic scatter? Do you think you could place some prior on \(p(m,b)\) that would result in a small intrinsic scatter along the line? Why or why not? Write down that prior.

At this point many students find it useful to reconsider what they think of the term uncertainty. Uncertainty and error are not the same thing. Errors are mistakes. Data have uncertainties, not errors. If you appreciate that data are uncertain then you will feel more comfortable to learn that models can also be uncertain. What does this mean? I don't mean that model parameters are uncertain,Although, they do. what I mean is that you can include parameters in your model that describes uncertainty (or variance). And thanks to Bayes, we can introduce additional parameters to our model and marginalise them out! Better yet, here we use the fact that the convolution of a Gaussian with a Gaussian is a Gaussian, and the second Gaussian here is the zero-centered and a covariance matrix built from our model variance!

Let's do that. We will introduce a parameter that describes some intrinsic variance to the straight line relationship. Which direction should this intrinsic variance project to? The \(x\) direction? \(y\)? The answer is both: we want the intrinsic variance to be projected orthogonal to the line.

There are a few ways to parameterise this model. I like Dan Foreman-Mackey's nomenclature because it more closely matches my intuition.And because it's formally correct. The original version of Hogg, Bovy, & Lang (2010) contained an error in their Eq. 32 such that in the limit of zero noise in the \(x\) direction it did not reduce to \(\chi^2\). The GitHub repository has been updated, and a new version will probably be posted to the arXiv soon.

We will state that the measured \(x\) and \(y\) values are drawn from a two-dimensional multivariate normal distribution such that \[ \left(\begin{array}{c} x_i \\ y_i \end{array} \right) \sim \mathcal{N}\left(\left[\begin{array}{c} \hat{x}_i \\ m\hat{x}_i + b \end{array}\right],\mathbf{\Lambda}\right) \] where \(\mathbf{\Lambda}\) is our two-dimensional covariance matrix that we will define later. Thanks to the sheer magic of Gaussians, the expected frequency distribution is \[ p(y_i,x_i|m,b,\hat{x}_i,\mathbf{C}_i,\mathbf{\Lambda}) = \frac{1}{2\pi\sqrt{\det{\left(\mathbf{C}_i + \mathbf{\Lambda}\right)}}}\exp\left(-\frac{1}{2}{\mathbf{R}_i}\transpose\left(\mathbf{C}_i + \mathbf{\Lambda}\right)^{-1}\mathbf{R}_i\right) \quad . \]

Integrating out \(\hat{x}_i\) as before leaves \[ p(y_i,x_i|m,b,\mathbf{C}_i,\mathbf{\Lambda}) \propto \frac{1}{\sqrt{2\pi{\Sigma_i}^2}}\exp\left(-\frac{{\Delta_i}^2}{2{\Sigma_i}^2}\right) \] where \(\Delta_i = y_i - mx_i - b\) as before but \[ {\Sigma_i}^2 \,=\, \mathbf{V}\transpose\left(\mathbf{C}_i + \mathbf{\Lambda}\right)\mathbf{V} \quad . \]

We still need to define \(\mathbf{\Lambda}\).I intentionally left this part until after we had defined our log likelihood function because it is the part that requires you to conceptualise the projection orthogonal to the line. Consider first that the intrinsic variance is only in the \(x\) direction: \[ \mathbf{\Lambda} = \lambda^2\left[\begin{array}{cc} 1 & 0 \\ 0 & 0 \end{array}\right] \] Similarly, if the intrinsic variance projected only in the \(y\) direction: \[ \mathbf{\Lambda} = \lambda^2\left[\begin{array}{cc} 0 & 0 \\ 0 & 1 \end{array}\right] \] What we actually want is the orthogonal projection to the \(m\hat{x} + b\) line. You should prove this for yourself (numerically and mathematically), but the matrix we want to have the intrinsic scatter \(\lambda\) perpendicular to the line \(m\hat{x} + b\) is \[ \mathbf{\Lambda} = \frac{\lambda^2}{1 + m^2} \left[\begin{array}{cc} m^2 & -m \\ -m & 1 \end{array}\right] \quad . \]

The intrinsic scatter \(\lambda\) will be projected orthogonal to the line, but we will still need to choose a prior on \(\lambda\). It's common to use a uniform prior in log-spaceAlso referred to as a Jeffreys prior. for scale parameters because they are invariant under a change of coordinates. The prior we adopt is \[ p(\ln\lambda) \sim \mathcal{U}\left(-10, 10\right) \quad . \]

Ok, let's code it up. # We will parameterize our model as: # theta = [b, m, ln_lambda] def ln_prior(theta): b, m, ln_lambda = theta if ln_lambda > 10 or ln_lambda < -10: return -np.inf return -3/2 * np.log(1 + m**2) def ln_likelihood(theta, x, y, C): b, m, ln_lambda = theta # Define projection vector. V = np.array([[-m, 1]]).T # Define orthogonal projection matrix. intrinsic_variance = np.exp(ln_lambda)**2 Lambda = (intrinsic_variance / (1 + m**2)) * np.array([ [m**2, -m], [-m, 1] ]) Delta = (y - m * x - b) Sigma = (V.T @ (C + Lambda) @ V).flatten() # Dropping constant terms. return np.sum(-np.log(Sigma) - 0.5 * Delta**2 / Sigma) def ln_probability(theta, x, y, C): lp = ln_prior(theta) if not np.isfinite(lp): return lp return lp + ln_likelihood(theta, x, y, C) # Optimize from linear algebra solution again, # except we will assume ln_lambda = -3 for initialisation args = (x, y, covs) initial_theta = np.hstack([X.T[0], -3]) # Optimize! result = op.minimize(lambda *args: -ln_probability(*args), initial_theta, args=args, method="L-BFGS-B", bounds=[(None, None), (None, None), (-10, 10)]) # Sample! ndim, nwalkers = (result.x.size, 32) p0 = [result.x + 1e-5 * np.random.randn(ndim) for k in range(nwalkers)] sampler = emcee.EnsembleSampler( nwalkers, ndim, ln_probability, args=args ) # Run the burn-in. pos, *_ = sampler.run_mcmc(p0, 500) sampler.reset() # Run production. sampler.run_mcmc(pos, 1000) # Make a corner plot. chain = sampler.chain.reshape((-1, ndim)) fig = corner( chain, labels=(r"$b$", r"$m$", r"$\log{\lambda}$") ) # Show the linear algebra solution in blue for comparison. ax = fig.axes[3] ax.scatter(*X.T[0], s=20, facecolor="tab:blue", zorder=100) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=0, alpha=0.5, facecolor="tab:blue", color="tab:blue", zorder=10)) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=2, facecolor="none", color="tab:blue", zorder=60))

# Make posterior predictions. fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter(x, y, c="k", s=10) for xi, yi, cov in zip(x, y, covs): ax.add_artist(_ellipse(xi, yi, cov, scale=2, color="k")) xlim = np.array([0, 300]) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(7)) # Plot draws of the posterior. for index in np.random.choice(chain.shape[0], size=100): b, m, ln_lambda = chain[index] ax.plot( xlim, m * xlim + b, "-", c="tab:purple", alpha=0.2, lw=0.5, zorder=-1 ) ax.set_xlim(*xlim) ax.set_ylim(0, 700) fig.tight_layout()

How does this result compare to our previous results where we did not consider uncertainties in the \(x\) direction, and where we did not include intrinsic scatter? Is the fit better or worse to you? Does the model uncertainty look reasonable, given the data? Is the model uncertainty realistic? Are our priors reasonable, or is there a lot of posterior support at the edges? What descriptive change do you want to make to this model so that it is more realistic, or more in line with your expectations?

In the next class we will discuss the concept of outliers, and how to objectively treat them using a mixture model.

 

← Previous class
Fitting a model to data I
Next class →
Fitting a model to data III

Contributions

The example data table was taken from Hogg et al. (2010).

Dan Foreman-Mackey (Flatiron) first posted a blog post about fitting a plane to data, which was extremely useful for clarifying the formalism in this material.