Problem Set 1 Solutions

Data analysis and machine learning

\[ \newcommand{\transpose}{^{\scriptscriptstyle \top}} \newcommand{\vec}[1]{\mathbf{#1}} \]

Question 1

(Total 10 marks available)

Let us assume a simple model of the form \[y_i \sim \mathcal{N}(mx_i+b,\sigma_{y_i}) \quad .\]

Cast this problem in matrix form and show how the expected frequency distribution is related to the \(\chi^2\) statistic.

Let \[ \mathbf{A} = \left[\begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ 1 & \cdots \\ 1 & x_N \end{array}\right] \] \[ \mathbf{X} = \left[\begin{array}{c} b \\ m \end{array}\right] \] such that \[ \mathbf{Y} = \left[\begin{array}{c} y_{1} \\ y_{2} \\ \cdots \\ y_N \end{array}\right] \] and \[ \vec{Y} = \vec{AX} \quad . \]

The data are drawn from a normal distribution \[ \vec{Y} \sim \mathcal{N}\left(\vec{AX},\vec{C}\right) \] where \[ \mathbf{C} = \left[\begin{array}{cccc} \sigma_{y1}^2 & 0 & \cdots & 0 \\ 0 & \sigma_{y2}^2 & \cdots & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & \cdots & \sigma_{yN}^2 \end{array}\right] \]

The probability density function for a normal distribution in \(D\) dimensions is \[ p(\vec{Y}) = \frac{1}{(2\pi)^{D/2}|{\vec{C}}|^{1/2}}\exp{\left(-\frac{1}{2}\left[\vec{Y}-\vec{AX}\right]\transpose\vec{C}^{-1}\left[\vec{Y}-\vec{AX}\right]\right)} \quad . \] and the log probability density function is \[ \log{}p(\vec{Y}) = -\frac{1}{2}\left[\vec{Y}-\vec{AX}\right]\transpose\vec{C}^{-1}\left[\vec{Y}-\vec{AX}\right] - \frac{D}{2}\log{2\pi} -\frac{1}{2}\log{|\vec{C}|} \] \[ \log{}p(\vec{Y}) = -\frac{1}{2}\left[\vec{Y}-\vec{AX}\right]\transpose\vec{C}^{-1}\left[\vec{Y}-\vec{AX}\right] + \mathrm{constant~terms} \] \[ \log{}p(\vec{Y}) = -\frac{1}{2}\chi^2 + \mathrm{constant~terms} \] since \[ \chi^2 = \left[\vec{Y}-\vec{AX}\right]\transpose\vec{C}^{-1}\left[\vec{Y}-\vec{AX}\right] \quad . \]

Question 2

(Total 10 marks available)

Write the definition of \(\chi^2\) in matrix form, and take its first derivative with respect to \(\vec{X}\) (the vector of unknown model parameters) and show that the solution to \(\vec{Y} = \vec{A}\vec{X}\), with a covariance matrix \(\vec{C}\) and numerous implicit assumptions, is \[\vec{X} = \left(\vec{A}\transpose\vec{C}^{-1}\vec{A}\right)^{-1}\left(\vec{A}\transpose\vec{C}^{-1}\vec{Y}\right) \quad .\]

The definition of \(\chi^2\) in matrix form is: \[ \chi^2 = \left[\vec{Y}-\vec{AX}\right]\transpose\vec{C}^{-1}\left[\vec{Y}-\vec{AX}\right] \] This can be re-cast as \(\chi^2 = \vec{G}\vec{H}\) where \(\vec{G}(\vec{X})\) and \(\vec{H}(\vec{X})\). Specifically, \[ \vec{G}(\vec{X}) = \left[\vec{Y}-\vec{AX}\right]\transpose \] and \[ \vec{H}(\vec{X}) = \vec{C}^{-1}\left[\vec{Y}-\vec{AX}\right] \quad . \] Using the chain rule, \[ \frac{\partial\chi^2}{\partial\vec{X}} = \frac{\partial\vec{G}}{\partial\vec{X}}\vec{H} + \vec{G}\frac{\partial\vec{H}}{\partial\vec{X}} \] and expanding terms leads to \[ \frac{\partial\chi^2}{\partial\vec{X}} = -\vec{A}\transpose\vec{C}^{-1}\left[\vec{Y}-\vec{A}\vec{X}\right] + \left[\vec{Y}-\vec{A}\vec{X}\right]\transpose\vec{C}^{-1}\left(-\vec{A}\right) \] \[ \frac{\partial\chi^2}{\partial\vec{X}} = -\vec{A}\transpose\vec{C}^{-1}\vec{Y} + \vec{A}\transpose\vec{C}^{-1}\vec{AX}-\vec{Y}\transpose\vec{C}^{-1}\vec{A} + \left(\vec{AX}\right)\transpose\vec{C}^{-1}\vec{A} \] Now, we have \[ \vec{A}\transpose\vec{C}^{-1}\vec{Y} \equiv \left(\vec{Y}\vec{C}^{-1}\vec{A}\right)\transpose \] and is a scalar. And let \(\vec{Z} = \vec{A}\transpose\vec{C}^{-1}\vec{A}\) so \(\vec{ZX} + \vec{X}\transpose\vec{Z} = 2\vec{ZX}\) since \(\vec{Z}\) is symmetric.

This reduces to \[ \frac{\partial\chi^2}{\partial\vec{X}} = -2\vec{A}\transpose\vec{C}^{-1}\vec{Y} + 2\vec{A}\transpose\vec{C}^{-1}\vec{A}\vec{X} \] and we let \[ \frac{\partial\chi^2}{\partial\vec{X}} = 0 \] such that \[ 2\vec{A}\transpose\vec{C}^{-1}\vec{A}\hat{\vec{X}} = 2\vec{A}\transpose\vec{C}^{-1}\vec{Y} \] \[ \hat{\vec{X}} = \left(\vec{A}\transpose\vec{C}^{-1}\vec{A}\right)^{-1}\left(\vec{A}\transpose\vec{C}^{-1}\vec{Y}\right) \] as required.

Question 3

(Total 10 marks available)

Show that the matrix to project intrinsic scatter \(\lambda\) perpendicular to the line \(y = mx + b\) is \[ \vec{\Lambda} = \frac{\lambda^2}{1 + m^2}\left[\begin{array}{cc} m^2 & -m \\ -m & 1 \end{array}\right] \quad . \]

Question 4

(Total 10 marks available)

Use the example data given above. Assume that the \(y\) values are generated by a straight line model where there is some intrinsic scatter to the line, and there are uncertainties in the \(x\)- and \(y\)-direction, but there is no correlation between the \(x\)- and \(y\)-uncertainties.

You need to fully specify this model. That includes specifying the model parameters, the priors on those parameters, any marginalisations, the log likelihood function, and the log prior function.

Question 5

(Total 10 marks available)

Implement the model specified in Question 4 in a programming language of your choice.

Specify a sensible initial guess for the model parameters, and then optimise those parameters using an optimisation algorithm of your choice. Use an off-the-shelf Markov-chain Monte Carlo package to sample the model posteriors.

Plot the posterior distributions of the model parameters, and plot the posterior predictions of the model compared to the data.

import numpy as np import matplotlib.pyplot as plt from matplotlib.ticker import MaxNLocator from matplotlib.patches import Ellipse from scipy import optimize as op import emcee from corner import corner # For reproducibility np.random.seed(0) # Load the data. _, x, y, y_err, x_err, rho_xy = data = np.array([ [1, 201, 592, 61, 9, -0.84], [2, 244, 401, 25, 4, +0.31], [3, 47, 583, 38, 11, +0.64], [4, 287, 402, 15, 7, -0.27], [5, 203, 495, 21, 5, -0.33], [6, 58, 173, 15, 9, +0.67], [7, 210, 479, 27, 4, -0.02], [8, 202, 504, 14, 4, -0.05], [9, 198, 510, 30, 11, -0.84], [10, 158, 416, 16, 7, -0.69], [11, 165, 393, 14, 5, +0.30], [12, 201, 442, 25, 5, -0.46], [13, 157, 317, 52, 5, -0.03], [14, 131, 311, 16, 6, +0.50], [15, 166, 400, 34, 6, +0.73], [16, 160, 337, 31, 5, -0.52], [17, 186, 423, 42, 9, +0.90], [18, 125, 334, 26, 8, +0.40], [19, 218, 533, 16, 6, -0.78], [20, 146, 344, 22, 5, -0.56], ]).T # Helpful function. def _ellipse(x, y, cov, scale=2, **kwargs): vals, vecs = np.linalg.eig(cov) theta = np.degrees(np.arctan2(*vecs[::-1, 0])) w, h = scale * np.sqrt(vals) kwds = dict(lw=0.5) kwds.update(**kwargs) ellipse = Ellipse(xy=[x, y], width=w, height=h, angle=theta, **kwds) ellipse.set_facecolor("none") return ellipse # Question 5. def ln_prior(theta): b, m, ln_lambda = theta if ln_lambda > 10 or ln_lambda < -10: return -np.inf return -3/2 * np.log(1 + m**2) def ln_likelihood(theta, x, y, C): b, m, ln_lambda = theta # Define projection vector. V = np.array([[-m, 1]]).T # Define orthogonal projection matrix. intrinsic_variance = np.exp(ln_lambda)**2 Lambda = (intrinsic_variance / (1 + m**2)) * np.array([ [m**2, -m], [-m, 1] ]) Delta = (y - m * x - b) Sigma = (V.T @ (C + Lambda) @ V).flatten() # Dropping constant terms. return np.sum(-np.log(Sigma) - 0.5 * Delta**2 / Sigma) def ln_probability(theta, x, y, C): lp = ln_prior(theta) if not np.isfinite(lp): return lp return lp + ln_likelihood(theta, x, y, C) # Optimize from linear algebra solution again, # except we will assume ln_lambda = -3 for initialisation # No correlations. covs = np.array([[[x_e**2, 0], [0, y_e**2]] \ for y_e, x_e, rho in zip(*data[3:])]) args = (x, y, covs) Y = np.atleast_2d(y).T A = np.vstack([np.ones_like(x), x]).T C = np.diag(y_err**2 + x_err**2) C_inv = np.linalg.inv(C) G = np.linalg.inv(A.T @ C_inv @ A) X = G @ (A.T @ C_inv @ Y) initial_theta = np.hstack([X.T[0], -3]) # Optimize! result = op.minimize(lambda *args: -ln_probability(*args), initial_theta, args=args, method="L-BFGS-B", bounds=[(None, None), (None, None), (-10, 10)]) # Sample! ndim, nwalkers = (result.x.size, 32) p0 = [result.x + 1e-5 * np.random.randn(ndim) for k in range(nwalkers)] sampler = emcee.EnsembleSampler( nwalkers, ndim, ln_probability, args=args ) # Run the burn-in. pos, *_ = sampler.run_mcmc(p0, 500) sampler.reset() # Run production. sampler.run_mcmc(pos, 1000) # Make a corner plot. chain = sampler.chain.reshape((-1, ndim)) fig = corner( chain, labels=(r"$b$", r"$m$", r"$\log{\lambda}$") ) # Show the linear algebra solution in blue for comparison. ax = fig.axes[3] ax.scatter(*X.T[0], s=20, facecolor="tab:blue", zorder=100) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=0, alpha=0.5, facecolor="tab:blue", color="tab:blue", zorder=10)) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=2, facecolor="none", color="tab:blue", zorder=60)) fig.savefig("ps1-q5-solution-1.png", dpi=300) fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter(x, y, c="k", s=10) ax.errorbar(x, y, xerr=x_err, yerr=y_err, fmt="o", lw=1, c="k") xlim = np.array([0, 300]) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(7)) # Plot draws of the posterior. for index in np.random.choice(chain.shape[0], size=100): b, m, ln_lambda = chain[index] ax.plot( xlim, m * xlim + b, "-", c="tab:red", alpha=0.1, lw=0.5, zorder=-1 ) ax.set_xlim(*xlim) ax.set_ylim(0, 700) fig.tight_layout() fig.savefig("ps1-q5-solution-2.png", dpi=300) Produces these figures:

Question 6

(Total 10 marks available)

Re-run your model from Question 5, but this time let us assume that the errors in the \(x\)- and \(y\)-direction were incorrectly under-estimated!

The real errors are twice as large as the values given in the table above. Re-run your inferences with the real error values, and comment on changes to what you found in Question 5.

# Question 6. x_err *= 2 y_err *= 2 covs = np.array([[[x_e**2, 0], [0, y_e**2]] \ for y_e, x_e in zip(y_err, x_err)]) args = (x, y, covs) Y = np.atleast_2d(y).T A = np.vstack([np.ones_like(x), x]).T C = np.diag(y_err**2 + x_err**2) C_inv = np.linalg.inv(C) G = np.linalg.inv(A.T @ C_inv @ A) X = G @ (A.T @ C_inv @ Y) initial_theta = np.hstack([X.T[0], -3]) # Optimize! result = op.minimize(lambda *args: -ln_probability(*args), initial_theta, args=args, method="L-BFGS-B", bounds=[(None, None), (None, None), (-10, 10)]) # Sample! ndim, nwalkers = (result.x.size, 32) p0 = [result.x + 1e-5 * np.random.randn(ndim) for k in range(nwalkers)] sampler = emcee.EnsembleSampler( nwalkers, ndim, ln_probability, args=args ) # Run the burn-in. pos, *_ = sampler.run_mcmc(p0, 500) sampler.reset() # Run production. sampler.run_mcmc(pos, 1000) # Make a corner plot. chain = sampler.chain.reshape((-1, ndim)) fig = corner( chain, labels=(r"$b$", r"$m$", r"$\log{\lambda}$") ) fig.savefig("ps1-q6-solution-1.png", dpi=300) # Show the linear algebra solution in blue for comparison. ax = fig.axes[3] ax.scatter(*X.T[0], s=20, facecolor="tab:blue", zorder=100) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=0, alpha=0.5, facecolor="tab:blue", color="tab:blue", zorder=10)) ax.add_artist(_ellipse(*X.T[0], G, scale=3, lw=2, facecolor="none", color="tab:blue", zorder=60)) fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter(x, y, c="k", s=10) ax.errorbar(x, y, xerr=x_err, yerr=y_err, fmt="o", lw=1, c="k") xlim = np.array([0, 300]) ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(7)) # Plot draws of the posterior. for index in np.random.choice(chain.shape[0], size=100): b, m, ln_lambda = chain[index] ax.plot( xlim, m * xlim + b, "-", c="tab:red", alpha=0.1, lw=0.5, zorder=-1 ) ax.set_xlim(*xlim) ax.set_ylim(0, 700) fig.tight_layout() fig.savefig("ps1-q6-solution-2.png", dpi=300)

Question 7

(Total 10 marks available)

Your collaborator who recorded the data has suddenly remembered that there were arbitrary correlations between the uncertainties in individual data points, in addition to there being intrinsic scatter in the model.

They have also just remembered that some of the data points may be errors, and were not actually generated by the straight line model.

You will need to fully specify a model that accounts for intrinsic scatter, uncertainties in the \(x\)- and \(y\)-direction, correlations between the \(x\)- and \(y\)-uncertainties, and accounts for the erroneous measurements (the outliers).

Fully specify this model: the model parameters, the priors on those parameters, any marginalisations, the log prior, the log likelihood, and the log posterior.

Question 8

(Total 10 marks available)

Implement the model from Question 7 in your programming language of choice.

Define the log likelihood function, the log prior function, and the log posterior probability function.

Calculate the log posterior probability on a grid in \(\mathbf{\theta}\), using sensible step sizes in each dimension, and plot the log posterior probability in all dimensions. Indicate the grid point with the highest log probability.

_, x, y, y_err, x_err, rho_xy = data = np.array([ [1, 201, 592, 61, 9, -0.84], [2, 244, 401, 25, 4, +0.31], [3, 47, 583, 38, 11, +0.64], [4, 287, 402, 15, 7, -0.27], [5, 203, 495, 21, 5, -0.33], [6, 58, 173, 15, 9, +0.67], [7, 210, 479, 27, 4, -0.02], [8, 202, 504, 14, 4, -0.05], [9, 198, 510, 30, 11, -0.84], [10, 158, 416, 16, 7, -0.69], [11, 165, 393, 14, 5, +0.30], [12, 201, 442, 25, 5, -0.46], [13, 157, 317, 52, 5, -0.03], [14, 131, 311, 16, 6, +0.50], [15, 166, 400, 34, 6, +0.73], [16, 160, 337, 31, 5, -0.52], [17, 186, 423, 42, 9, +0.90], [18, 125, 334, 26, 8, +0.40], [19, 218, 533, 16, 6, -0.78], [20, 146, 344, 22, 5, -0.56], ]).T def ln_prior(theta): b, m, ln_lambda, Q, mu_o, ln_sigma_v = theta in_bounds = (1 > Q) * (Q > 0) \ * (10 > ln_sigma_v) * (ln_sigma_v > -10) \ * (10 > ln_lambda) * (ln_lambda > -10) if not in_bounds: return -np.inf return -3/2 * np.log(1 + m**2) def ln_likelihood_fg(theta, x, y, C): b, m, ln_lambda, Q, mu_o, ln_sigma_v = theta # Define projection vector. V = np.array([[-m, 1]]).T # Define orthogonal projection matrix. intrinsic_variance = np.exp(ln_lambda)**2 Lambda = (intrinsic_variance / (1 + m**2)) * np.array([ [m**2, -m], [-m, 1] ]) Delta = (y - m * x - b) Sigma = (V.T @ (C + Lambda) @ V).flatten() # Dropping constant terms. return (-np.log(Sigma) - 0.5 * Delta**2 / Sigma) # And the outlier model as 'background'. def ln_likelihood_bg(theta, x, y, C): *_, Q, mu_o, ln_sigma_v = theta # Total variance is y_err^2 + sigma_v^2. Sigma = np.exp(ln_sigma_v)**2 + C[:, 0, 0] return (-0.5 * (y - mu_o)**2 / Sigma - np.log(Sigma)) def ln_probability(theta, x, y, C, full_output=True): lp = ln_prior(theta) #if not np.isfinite(lp): # return (-np.inf, np.nan * np.ones((2, x.size))) b, m, ln_lambda, Q, mu_o, ln_sigma_v = theta # Compute weighted foreground likelihoods for each data point. # (They are weighted by the log(Q) term) ll_fg = np.log(Q) + ln_likelihood_fg(theta, x, y, C) # Compute weighted background likelihoods for each data point. ll_bg = np.log(1 - Q) + ln_likelihood_bg(theta, x, y, C) # Sum the log of the sum of the exponents of the log likelihoods. ll = np.sum(np.logaddexp(ll_fg, ll_bg)) if full_output: return (lp + ll, np.vstack([ll_fg, ll_bg])) else: return lp + ll covs = np.array([[[x_e**2, x_e * y_e * rho], [x_e * y_e * rho, y_e**2]] \ for y_e, x_e, rho in zip(y_err, x_err, rho_xy)]) args = (x, y, covs) initial_theta = np.array([250, 1, 1, 0.5, 400, -3]) lp, lls = ln_probability(initial_theta, x, y, covs) import itertools n_bins = 10 grid_params = [ np.linspace(-200, 200, n_bins), # b np.linspace(0, 4, n_bins), # m, np.linspace(-5, 5, n_bins), # ln_lambda np.linspace(0.1, 0.9, n_bins), # Q np.linspace(1, 700, n_bins), # mu_o np.linspace(-5, 5, n_bins), # ln_sigma_v ] K = len(grid_params) import operator, functools from tqdm import tqdm # Calculate the number of products. P = functools.reduce(operator.mul, map(len, grid_params), 1) grid = np.empty((P, K)) lps = np.empty(P) for p, point in enumerate(tqdm(itertools.product(*grid_params), total=P)): lps[p] = ln_probability(point, x, y, covs, full_output=False) grid[p] = point # Now make figures. vmax = np.max(lps) fig, axes = plt.subplots(K - 1, K - 1, figsize=(8.5, 8.5)) labels = [r"$b$", r"$m$", r"$\log{\lambda}$", r"$Q$", r"$\mu_o$", r"$\log{\sigma_v}$"] for i in range(K): for j in range(K): try: ax = axes[j, i] except IndexError: continue if i > j: try: ax.set_visible(False) except IndexError: None continue x_idx = i y_idx = j + 1 x_points = grid_params[x_idx] y_points = grid_params[y_idx] # Get the values. unique_points = np.unique(grid[:, [x_idx, y_idx]], axis=0) max_lp = np.zeros(unique_points.shape[0]) # Get max log probability at each point. for k, unique_point in enumerate(unique_points): mask = np.all(unique_point == grid[:, [x_idx, y_idx]], axis=1) max_lp[k] = np.max(lps[mask]) x_bounds = (np.min(x_points), np.max(x_points)) y_bounds = (np.min(y_points), np.max(y_points)) imshow_kwds = dict( origin="lower", extent=(*x_bounds, *y_bounds), aspect=np.ptp(x_bounds)/np.ptp(y_bounds), vmax=vmax ) Z = max_lp.reshape((np.unique(x_points).size, -1)).T ax.imshow( Z, cmap="Blues", **imshow_kwds ) xi, yi = np.unravel_index(np.argmax(Z), Z.shape) ax.scatter([x_points[yi]], [y_points[xi]], c="#000000", zorder=10) if ax.is_last_row(): ax.set_xlabel(labels[x_idx]) else: ax.set_xticklabels([]) if ax.is_first_col(): ax.set_ylabel(labels[y_idx]) else: ax.set_yticklabels([]) ax.set_xlim(*x_bounds) ax.set_ylim(*y_bounds) fig.tight_layout() fig.savefig("ps1-q8-solution.png", dpi=300)