# Problem Set 4

Data analysis and machine learning

Due date: 2021-03-25 17:00 (Melbourne time) unless by prior arrangement

Your submission should be in the form of a PDF that includes relevant figures. The PDF can be compiled from $$\LaTeX$$ or outputted by Jupyter notebook, or similar. You must also submit code scripts that reproduce your work in full.

Marks will depend on the results/figures that you produce, and the clarity and depth of your accompanying interpretation. Don't just submit figures and code! You must demonstrate understanding and justification of what you have submitted. Please ensure figures have appropriate axes, that you have adopted sensible mathematical nomenclature, et cetera.

$\newcommand{\transpose}{^{\scriptscriptstyle \top}} \newcommand{\vec}{\mathbf{#1}}$

In total there are 2 questions in this problem set, with a total of 30 marks available.

### Question 1

(Total 10 marks available)

Define and explain the following terms in machine learning. You may provide equations or example figures to aid your explanation.

1. Precision
2. Recall
3. F1 score
4. Regularisation, including LASSO and Ridge regression
5. Confusion matrices

### Question 2

(Total 20 marks available)

(This question requires you to install the scikit-learn Python package.)

You are tasked with performing dimensionality reduction on a set of images of human faces. In this question you will use an existing package to do the PCA for you. Remember that if you were to do this yourself, your data must be mean-centered and unit variance in every dimension before you do PCA!

from sklearn.datasets import fetch_lfw_people faces = fetch_lfw_people(min_faces_per_person=50) # Who are these people?! print(faces.target_names) ['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush' 'Gerhard Schroeder' 'Hugo Chavez' 'Jacques Chirac' 'Jean Chretien' 'John Ashcroft' 'Junichiro Koizumi' 'Serena Williams' 'Tony Blair'] # What do their faces look like? print(faces.images.shape) (1560, 62, 47) # The target name index for each image (0 = Ariel Sharon, etc) print(faces.target.shape) 1560 print(faces.target) [11, 4, 2, ..., 3, 11, 5]

You can see that we have 1,560 images, where each image is 62 pixels by 47 pixels. Let's plot an image. import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(4, 4.75)) ax.imshow(faces.images, cmap="binary_r") ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") fig.tight_layout() Serena Williams is better at tennis than you and me.

Use PCA to fit these images (sklearn.decomposition.PCA) and find the first 150 principal components. You will find that the dimensionality of this data is large, so vanilla PCA will have a hard time. Use the 'random' algorithm as the SVD solver, which takes random subsets of the data to find the first $$N$$ components. Make a figure showing the first 50 components (eigenfaces) as images, with one image per panel. This code example might help you set up your figure: