Data analysis and machine learning

Consider that you have been given some data set consisting of \(x\) and \(y\) values. You are asked to provide a model that describes these data. Each data point has uncertainties in the \(x\)- and \(y\)-direction, and there is some covariance between the uncertainties in each direction. Where do you start? What will you assume?

Here are the data:

ID | \(x\) | \(y\) | \(\sigma_y\) | \(\sigma_x\) | \(\rho_{xy}\) |
---|---|---|---|---|---|

1 | 201 | 592 | 61 | 9 | -0.84 |

2 | 244 | 401 | 25 | 4 | 0.31 |

3 | 47 | 583 | 38 | 11 | 0.64 |

4 | 287 | 402 | 15 | 7 | -0.27 |

5 | 203 | 495 | 21 | 5 | -0.33 |

6 | 58 | 173 | 15 | 9 | 0.67 |

7 | 210 | 479 | 27 | 4 | -0.02 |

8 | 202 | 504 | 14 | 4 | -0.05 |

9 | 198 | 510 | 30 | 11 | -0.84 |

10 | 158 | 416 | 16 | 7 | -0.69 |

11 | 165 | 393 | 14 | 5 | 0.30 |

12 | 201 | 442 | 25 | 5 | -0.46 |

13 | 157 | 317 | 52 | 5 | -0.03 |

14 | 131 | 311 | 16 | 6 | 0.50 |

15 | 166 | 400 | 34 | 6 | 0.73 |

16 | 160 | 337 | 31 | 5 | -0.52 |

17 | 186 | 423 | 42 | 9 | 0.90 |

18 | 125 | 334 | 26 | 8 | 0.40 |

19 | 218 | 533 | 16 | 6 | -0.78 |

20 | 146 | 344 | 22 | 5 | -0.56 |

Always plot the data first.

Or better yet: how do you think *other people* might (*incorrectly*) address this problem? You might assume that there is a straight line that relates \(x\) and \(y\). *Someone* might then assume that the point near \((x, y) = (50, 600)\) is not actually part of the data: it's a mistake, or an error, because it is an outlier. But what of some of the other points that are far away? When do you stop cutting outliers? And how do you account for the errors in each data point? Some have large errors in \(y\) but small errors in \(x\), do they count the same as large errors in \(x\) and small errors in \(y\)? Is a straight line good enough? Do we need some higher order polynomial to properly capture all the data points?

Let us assume that the data can be modelled by a straight line, the simplest model we can think of. We could imagine more exotic models, but for good reasons you will learn later, we won't. Note that often we do not know if the model that we have adopted is the *true model* or not.

Here a model — or more specifically — a generative model is one that can describe how measured data points were generated. You can imagine that a generative model is one where, if you *"turned the crank"*, then it would generate more data points for you (perhaps given some inputs \(x\)). When data can be modelled statistically then there is very little arbitrariness: if the choice of model or method is not consistent, then that will become apparent.

*A generative model is a parameterised, quantitative description of a statistical procedure that could reasonably have generated the data set you have.*

The standard practice for fitting a line to data is one where you have a set of \(N > 2\) points \((x_i, y_i)\), with no uncertainty in the \(x\) measurements (*i.e.*, perfect knowledge), and correctly known Gaussian uncertainties \(\sigma_{yi}\) in the \(y\) direction. You want to find the function

\[f(x) = mx + b\]

where \(m\) is the slope and \(b\) is the intercept that best fits the data.*best* means later on.

We need to describe an objective function*objective*, where here it is to find \(m\) and \(b\) that best fits the data.

Let us imagine that the data *truly* do come from a line with the form \(y = f(x) = mx + b\). We will further assume that there are no uncertainties in the \(x\) direction, or that those uncertainties are infinitesimally small. Then the only reason why any data point \(y\) deviates from this perfect narrow straight line is because each true \(y\) value has a small \(y\)-direction offset that has been added, where the offset was *drawn from**'drawn from'*
will appear a lot. In mathematical nomenclature you can also state that if \(x \sim \mathcal{N}\left(0, 1\right)\) then \(x\) is drawn from a normal (Gaussian) distribution of zero mean and unit variance. The \(\mathcal{N}\) symbol is merely shorthand for normal.
*e.g.*, that it is a photon counting device) then you would be more justified in stating that the data are drawn from a Poisson distribution. In either case, the distribution is important because it gives us a *frequency distribution* for the data.

Under this model if you are given an independent position \(x_i\), an uncertainty in the \(y\)-direction \(\sigma_{yi}\), a slope \(m\), and an intercept \(b\), then the frequency distribution \(p(y_i|x_i,\sigma_{yi},m,b)\) for \(y_i\) is \[p(y_i|x_i,\sigma_{yi},m,b) = \frac{1}{\sqrt{2\pi\sigma_{yi}^2}}\exp\left(-\frac{\left[y_i-mx_i - b\right]^2}{2\sigma_{yi}^2}\right)\]

which naturally peaks at \(mx_i + b\) and describes the expected frequency (or probability) of getting a value in the range \([y_i, y_i + dy]\) per unit \(dy\). Given many data points, we seek to find — or rather, optimise — the parameters \(m\) and \(b\) that maximise the probability (or likelihood) of the observed data given the model. Under the adopted model we assume that the data points are drawn independently, which means the total likelihood \(\mathcal{L}\) is the product of conditional probabilities of each data point \[\mathcal{L} = \prod_{i=1}^{N}p(y_i|x_i,\sigma_{yi},m,b) \quad .\]You can imagine this if you assumed some obviously incorrect set of parameters \((m,b)\) and evaluated the expected frequency for just a few data points. Let's consider that even though your set of parameters \((m,b)\) are wrong, by chance they almost exactly produce the first data point you observe \(y_1\). So much so that the difference between \(y_1\) and \(mx_1 + b\) is infinitesimally small compared to the \(y-\)direction uncertainty \(\sigma_{y1}\) on that data point. The probability of getting that value \(y_1\) given our incorrect \((m,b)\) parameters will be \(\approx 1\). Now consider the second data point \(y_2\), where the difference between \(y_2\) and \(mx_2 + b\) is enormous compared to the \(y\)-direction uncertainty \(\sigma_{y2}\), leading to a probability \(\approx 0\). Imagine the third point \(y_3\) is similarly discrepant, leading to another product of \(\approx 0\) By taking the product of conditional probabilities these values will \(\approx 0\) for some incorrect set of \((m,b)\), even if one data point happened to be consistent with the model. Unsurprisingly, our inference on the line parameters \((m,b)\) will only improve (asymptotically) with more data points.

When we are faced with the situation of taking the product of (potentially many) values close to zero, it's a good idea to take the logarithm and sum the values

In doing so, you will see this is not just a numerical convenience. Since the \(\sigma_{yi}\) values are constant, we can simplify this to \[\log{\mathcal{L}} = -\sum_{i=1}^{N}\frac{\left[y_i-mx_i -b\right]^2}{2\sigma_{yi}^2} + K\] \[\log{\mathcal{L}} = -\frac{1}{2}\chi^2 + K\] where \(K\) is some constant.

You may have heard of minimising \(\chi^2\) as a means to find the *"best fit"* parameters for some model, or you may have heard of the \(\chi^2\) value measure of the *"goodness of fit"* for some data set. Here you can see — under strict assumptions that we have not explicitly listed — exactly why that is so. Thanks to the (negative) \(-\frac{1}{2}\) coefficient shown above, in order to maximise the likelihood (or probability) we need to minimise the \(\chi^2\) value. Under this model, if you are minimising the \(\chi^2\) value you are implicitly maximising the probability.
\[\textrm{if }\chi^2 \approx 1000 \textrm{ then } \log\mathcal{L} \approx -500\]
\[\textrm{if }\chi^2 \approx 1 \textrm{ then } \log\mathcal{L} \approx -0.5\]

It is useful to show how this relation emerges to understand when it is valid. Consider our model \(f(x) = mx + b\), except we will introduce the matrices \[\mathbf{Y} = \left[\begin{array}{c} y_{1} \\ y_{2} \\ \cdots \\ y_N \end{array}\right]\] \[\mathbf{A} = \left[\begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ 1 & \cdots \\ 1 & x_N \end{array}\right]\] \[\mathbf{C} = \left[\begin{array}{cccc} \sigma_{y1}^2 & 0 & \cdots & 0 \\ 0 & \sigma_{y2}^2 & \cdots & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & \cdots & \sigma_{yN}^2 \end{array}\right]\]

where \(\mathbf{A}\) is the *design matrix* and \(\mathbf{C}\) is the covariance matrix, where we have ignored the off-diagonal components.

If we introduce the column vector \(\mathbf{X}\) have as entries the line parameters \(m\) and \(b\) \[ \mathbf{X} = \left[\begin{array}{c} b \\ m \end{array}\right] \] then we can re-write our line equation in matrix form \[ \mathbf{Y} = \mathbf{A}\mathbf{X} \quad . \]

But you can't solve that equation because it is overconstrained. Instead, we must consider (or *weight*) the importance of each data point given the covariance matrix \(\mathbf{C}\)
\[
\newcommand{\transpose}{^{\scriptscriptstyle \top}}
\begin{array}{rcl}
\mathbf{Y} &=& \mathbf{AX} \\
\mathbf{C}^{-1}\mathbf{Y} &=& \mathbf{C}^{-1}\mathbf{AX} \\
\mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{Y} &=& \mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{AX} \\
\left[\mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{Y}\right] &=& \left[\mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{A}\right]\mathbf{X}
\end{array}
\]
such that the best-fit values \(\mathbf{X}\) are given by
\[
\newcommand{\transpose}{^{\scriptscriptstyle \top}}
\mathbf{X} = \left[\mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{A}\right]^{-1}\left[\mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{Y}\right] \quad .
\]

When we weighted each data point by its corresponding diagonal entry in the inverse covariance matrix

Perhaps now you can appreciate that maximising the likelihood is identical to \(\chi^2\) minimisation, and that the vector \(\mathbf{X}\) contains as entries the line parameters \(b\) and \(m\) that produce the lowest \(\chi^2\) value, or the maximum \(\log\mathcal{L}\) value. Normally things are not so simple; usually we cannot solve for unknown parameters using linear algebra, forcing us to use optimisation algorithms to find the best fitting model parameters.

Let's find and plot the least-squares solution:

If the model we have adopted is correct (*i.e.*, the data are truly drawn from a narrow straight line with no errors in the \(x\)-direction), and the reported uncertainties on the \(y\) values are correct, and there is no rejection or clipping of outliers, then we already have uncertainties on the line parameters \(m\) and \(b\)
\[
\newcommand{\transpose}{^{\scriptscriptstyle \top}}
\left[
\begin{array}{cc}
\sigma_{b}^2 & \sigma_{mb} \\
\sigma_{mb} & \sigma_{m}^2
\end{array}
\right] = \left[\mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{A}\right]^{-1}
\]
which you will recall from earlier equations.

You should take a time to appreciate how rare it is to meet all of these assumptions in the real world. Rarely is the generative model we adopt the *true* model for the data, and rarely do the reported variances correctly describe the uncertainty in the measurements. And there are a host of other assumptions that must hold, which we have not listed here! For these reasons, it is likely that in most practical problems these uncertainties will be underestimated.

A nice way to visualise the uncertainty (while properly accounting for the covariances) is to draw from this multi-dimensional distribution, and plot the projections of those draws:

Does this look like a good fit? What's good about it? What's bad about it?

Sometimes you may be asked to show the \(\chi^2\) contours across some parameter space to give some intuition for how well the model describes the data. In these situation is it preferable to plot \(\exp\left(-\frac{1}{2}\chi^2\right)\) or \(\chi^2\)? Both will show us the model parameters where the likelihood is maximised, so does it actually matter what we show?

Let's plot the contours of \(\chi^2\) first, since that is a common thing people do.

Well, that looks kind of reasonable. The minimum \(\chi^2\) value appears at about the point where we have plotted the result from linear algebra, and the ellipse showing the covariance matrix \(\left[\mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{A}\right]^{-1}\) seems aligned with the valley of low \(\chi^2\) values. Are we done? What's unusual here?

One thing to consider might be that valley, we just described. The valley in \(\chi^2\) seems to extend well beyond the plotting axes. If we naively plotted \(\chi^2\) contours like this, and if we didn't know that our formal uncertainties are given by \(\left[\mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{A}\right]^{-1}\), we might *think* that the uncertainties in \(m\) and \(b\) are *enormous*! And from looking at the draws from the model, we might believe the two figures are consistent because the overall fit to the data is pretty poor. In practice, we are plotting slightly different things.

If we are wanting to visualise formal uncertainty in the model parameters, it's clear that there's some discrepancy between the \(\chi^2\) contours and \(\left[\mathbf{A}\transpose\mathbf{C}^{-1}\mathbf{A}\right]^{-1}\). Let's plot the alternative, \(\exp\left(-\frac{1}{2}\chi^2\right)\).

Now we're cooking! Why is this the right thing to visualise when we want to understand the uncertainty contours? The simple answer is because \(\exp\left(-\frac{1}{2}\chi^2\right)\) is proportional to our likelihood, whereas \(\chi^2\) is proportional to the *log* likelihood! Recall
\[
\log{\mathcal{L}} = -\frac{1}{2}\chi^2 + K
\]
such that the total likelihood (the product of conditional probabilities of each data point) is
\[
\mathcal{L} \propto \exp\left(-\frac{1}{2}\chi^2\right) \quad .
\]
And we can see that quite nicely here. The \(3\sigma\) ellipse exactly overlaps where we see support for \(\exp(-\frac{1}{2}\chi^2)\). And even if the scale of \(\exp(-\frac{1}{2}\chi^2)\) would be challenging to optimise numerically, it is only *because* the likelihood drops off so quickly! We won't say much more here other than to say that you should be wary of over-interpreting \(\chi^2\) (and \(\chi_r^2\)) values

In the next class we will explore the Bayesian generalisation of this problem, and how to incorporate uncertainty in both the \(x\) and \(y\) directions.

The example data table was taken from Hogg *et al.* (2010).*et al*. took the time to write that pedagogical note, which has been a continuously useful reference for me and many colleagues.

The idea to clarify the difference between visualising \(\chi^2\) and \(\exp(-\frac{1}{2}\chi^2)\) came from Boris Leistedt (Imperial College London) through his excellent write-up of fitting a line to data that he presented at the 2017 AstroHackWeek.