Problem Set 3

Data analysis and machine learning

Due date: 2021-03-18 17:00 (Melbourne time) unless by prior arrangement

Your submission should be in the form of a PDF that includes relevant figures. The PDF can be compiled from $\LaTeX$ or outputted by Jupyter notebook, or similar. You must also submit code scripts that reproduce your work in full.

Marks will depend on the results/figures that you produce, and the clarity and depth of your accompanying interpretation. Don't just submit figures and code! You must demonstrate understanding and justification of what you have submitted. Please ensure figures have appropriate axes, that you have adopted sensible mathematical nomenclature, et cetera.

\[ \newcommand{\transpose}{^{\scriptscriptstyle \top}} \newcommand{\vec}[1]{\mathbf{#1}} \]

In total there are 3 questions in this problem set, with a total of 60 marks available.

Question 1

(Total 10 marks available)

Draw a probabilistic graphical model for the model you specified in Question 7 of Problem Set 1.

Question 2

(Total 20 marks available)

Generate $N = 1000$ data points that are drawn from a mixture of $K = 5$ one-dimensional Gaussians. You can either set the model parameters $\vec{\theta} = \{\vec{\mu},\vec{\sigma},\vec{\pi}\}$ yourself, or set them randomly.

If you set them randomly, remember to seed the random number generator directly before creating the random values.

Question 2, Part A

Specify the model.

Question 2, Part B

Choose an initial guess for $\vec{\theta}$ that is 'far' away from the true values.

Provide and explain the equation for membership probabilities for each of the $N$ data points to each of the $K$ mixtures, and calculate these membership probabilities conditioned on the initial estimate of the model parameters given above.

Provide and explain the equation for updating the model parameters, conditioned on some set of membership probabilities. Calculate new estimates of the model parameters $\vec{\theta} = \{\vec{\mu}, \vec{\sigma}, \vec{\pi}\}$ conditioned on the membership probabilities that you have just calculated.

Question 2, Part C

Write code to calculate what you did in Question 2, Part B, and alternate between the Expectation and Maximization step for 100 iterations. Store the log likelihood with every iteration. Make a plot showing the log likelihood as a function of E-M step.

Question 2, Part D

Make a figure showing the data points, and the probability density for each of the $K$ mixtures using the model parameters found after 100 E-M steps.

Question 3

(Total 30 marks available)

There is overwhelming evidence of anthropogenic climate change. This question relates to what can be inferred from only a subset of the evidence available for global warming: a ficticious set of global temperatures spanning 135 years.

In this file you will find 1,000 fictious time series. Each series has length 135, assuming one measurement per year of temperature deviation from the mean, covering the time period from 1880-2014, inclusive. The data were first generated by drawing 1,000 random series (with some homoscedastic noise). Then, some of those series were randomly selected and had a trend added to them. The trends that were added were either +1°C / century or -1°C / century.

A bet has been offered for anyone who can correctly identify at least 900 series: those that were generated without a trend and which were generated with a trend. The prize is $100,000 US dollars.

Question 4, Part A

Make a figure showing all trend lines with year.

Question 4, Part B

Specify a generative model for these data.

Question 4, Part C

Implement this model in a programming language of your choice, and sample the parameters. Plot the chains. Has your MCMC converged? Plot the posterior probability distributions, and calculate percentiles (5th, 16th, 84th) to quote a mean value and uncertainty in each model parameter.

Question 4, Part D

For each time series, calculate posterior log probabilities of it being an unaffected time series, a time series with +1°C / century added, or a time series with -1°C / century added.

Question 4, Part E

For each time series, find the highest probability of membership from the available mixtures. Sum this value for all time series. This gives an expectationWell, an estimate of the expectation. for the number of series you may have estimated correctly: \[ E_\mathrm{correct} \approx \sum_{j=1}^{1000} \max\left(p_{j,k}\right) \quad . \] The uncertaintyIbid. for the number of series you have estimated correctly can be calculated by \[ \sigma_\mathrm{correct} \approx \sqrt{\sum_{j=1}^{1000} \left[\max\left(p_{j,k}\right)\left(1-\max\left(p_{j,k}\right)\right)\right]} \quad . \]

How many series do you expect to have calculated correctly? What is the uncertainty on that expectation value?

Question 4, Part F

Assuming a normal distribution, what are the chances that you would correctly identify 900 or more time series as being: an unaffected time series, a time series with +1°C / century added, or a time series with -1°C / century added? If you had to pay $10 to submit an entry to this competition, is it a worthwhile competition to enter (a worthwhile bet)?