Data analysis and machine learning

**Released:** 2022-03-22 14:00 (Melbourne time)

**Due date:** 2022-03-04 17:00 (Melbourne time) unless by prior arrangement

Your submission should be in the form of a PDF that includes relevant figures. The PDF can be compiled from \(\LaTeX\) or outputted by Jupyter notebook, or similar. You must also submit code scripts that reproduce your work in full.

Marks will depend on the results/figures that you produce, **and** the clarity and depth of your accompanying interpretation. Don't just submit figures and code! You must demonstrate understanding and justification of what you have submitted. Please ensure figures have appropriate axes, that you have adopted sensible mathematical nomenclature, *et cetera*.

There are 80 marks allocated to this problem set, but the problem set accounts for 12% of your final grade. For example, if you receive 72 marks on this problem set, then you will contribute 10.8% to your final grade. In this problem set there are 8 questions that are each assigned 10 marks, but the mark distribution will differ for future problem sets.

You will need the following data set for this problem set. You can find a copy of this data in Python format in the course notes from week 1.

ID | \(x\) | \(y\) | \(\sigma_y\) | \(\sigma_x\) | \(\rho_{xy}\) |
---|---|---|---|---|---|

1 | 201 | 592 | 61 | 9 | -0.84 |

2 | 244 | 401 | 25 | 4 | 0.31 |

3 | 47 | 583 | 38 | 11 | 0.64 |

4 | 287 | 402 | 15 | 7 | -0.27 |

5 | 203 | 495 | 21 | 5 | -0.33 |

6 | 58 | 173 | 15 | 9 | 0.67 |

7 | 210 | 479 | 27 | 4 | -0.02 |

8 | 202 | 504 | 14 | 4 | -0.05 |

9 | 198 | 510 | 30 | 11 | -0.84 |

10 | 158 | 416 | 16 | 7 | -0.69 |

11 | 165 | 393 | 14 | 5 | 0.30 |

12 | 201 | 442 | 25 | 5 | -0.46 |

13 | 157 | 317 | 52 | 5 | -0.03 |

14 | 131 | 311 | 16 | 6 | 0.50 |

15 | 166 | 400 | 34 | 6 | 0.73 |

16 | 160 | 337 | 31 | 5 | -0.52 |

17 | 186 | 423 | 42 | 9 | 0.90 |

18 | 125 | 334 | 26 | 8 | 0.40 |

19 | 218 | 533 | 16 | 6 | -0.78 |

20 | 146 | 344 | 22 | 5 | -0.56 |

\[ \newcommand{\transpose}{^{\scriptscriptstyle \top}} \newcommand{\vec}[1]{\mathbf{#1}} \]

*(Total 10 marks available)*

Let us assume a simple model of the form \[y_i \sim \mathcal{N}(mx_i+b,\sigma_{y_i}) \quad .\]

Cast this problem in matrix form and show how the expected frequency distribution is related to the \(\chi^2\) statistic.

*(Total 10 marks available)*

Write the definition of \(\chi^2\) in matrix form, and take its first derivative with respect to \(\vec{X}\) (the vector of unknown model parameters) and show that the solution to \(\vec{Y} = \vec{A}\vec{X}\), with a covariance matrix \(\vec{C}\) and numerous implicit assumptions, is \[\vec{X} = \left(\vec{A}\transpose\vec{C}^{-1}\vec{A}\right)^{-1}\left(\vec{A}\transpose\vec{C}^{-1}\vec{Y}\right) \quad .\]

*(Total 10 marks available)*

Show that the matrix to project intrinsic scatter \(\lambda\) perpendicular to the line \(y = mx + b\) is \[ \vec{\Lambda} = \frac{\lambda^2}{1 + m^2}\left[\begin{array}{cc} m^2 & -m \\ -m & 1 \end{array}\right] \quad . \]

*(Total 10 marks available)*

Use the example data given above. Assume that the \(y\) values are generated by a straight line model where there is some intrinsic scatter to the line, and there are uncertainties in the \(x\)- and \(y\)-direction, but there is no correlation between the \(x\)- and \(y\)-uncertainties.

You need to fully specify this model. That includes specifying the model parameters, the priors on those parameters, any marginalisations, the log likelihood function, and the log prior function.

*(Total 10 marks available)*

Implement the model specified in Question 4 in a programming language of your choice.

Specify a sensible initial guess for the model parameters, and then optimise those parameters using an optimisation algorithm of your choice. Use an off-the-shelf Markov-chain Monte Carlo package to sample the model posteriors.

Plot the posterior distributions of the model parameters, and plot the posterior predictions of the model compared to the data.

*(Total 10 marks available)*

Re-run your model from Question 5, but this time let us assume that the errors in the \(x\)- and \(y\)-direction were incorrectly under-estimated!

The real errors are twice as large as the values given in the table above. Re-run your inferences with the real error values, and comment on changes to what you found in Question 5.

*(Total 10 marks available)*

Your collaborator who recorded the data has suddenly remembered that there were arbitrary correlations between the uncertainties in individual data points, in addition to there being intrinsic scatter in the model.

They have also just remembered that some of the data points may be errors, and were not actually generated by the straight line model.

You will need to fully specify a model that accounts for intrinsic scatter, uncertainties in the \(x\)- and \(y\)-direction, correlations between the \(x\)- and \(y\)-uncertainties, and accounts for the erroneous measurements (the outliers).

Fully specify this model: the model parameters, the priors on those parameters, any marginalisations, the log prior, the log likelihood, and the log posterior.

*(Total 10 marks available)*

Implement the model from Question 7 in your programming language of choice.

Define the log likelihood function, the log prior function, and the log posterior probability function.

Calculate the log posterior probability on a grid in \(\mathbf{\theta}\), using sensible step sizes in each dimension, and plot the log posterior probability in all dimensions. Indicate the grid point with the highest log probability.