Data analysis
and machine learning

ASP5020/PHS5020 sub-unit for the ASP/PHS Masters courses

Data analysis and machine learning are different thingsOne could reasonably argue that machine learning is a subset of data analysis, but no-one could reasonably argue the opposite., and there is a huge amount of material that could be included from either field. However, time is finite.

Breadth and depth are both important in these fields.

Breadth provides you with justified intuition for what approach is suitable for a given problem, so that you're not always just choosing "the most recent thing you learned about in machine learning" when confronted with a new problem. Depth provides you with an expert understanding of a particular set of topics, allowing you to make more advanced steps in your field of interest (be it physics, chemistry, applied machine learning, etc).

To balance the competing interests of breadth and depth, this sub-unit is designed such that the lectures will provide you breadth on a large number of topics in data analysis and machine learning, and the assignments will give you the opportunity to develop depth in particular areas.

Assumed knowledge

In this sub-unit I will assume that you have at least a graduate-level understanding of the following topics:

Tensors
Linear algebra
Conditional probability
Python programming

Classes

The 2022 Semester 1 course will be taught in person. The class timetable for in person lectures is:

Tuesdays 11:00-12:00 in Room 110
Thursdays 11:00-12:00 in Room 110
Thursdays 14:00-15:00 in Room 110

Week 1 starts 2022 February 21. The course will run for six weeks.

The course material is covered through detailed notes that you will find on this website, and through pre-recorded lectures that are hosted on Panopto, and linked from the class Moodle page. I strongly encourage all students (in person or remote) to watch the pre-recorded lectures, and read the course notes carefully. As you read the course notes, reproduce the derivations and the example code that is provided (do not copy and paste).

Any student who carefully watches the pre-recorded lectures and studies the course notes will have sufficient understanding to do well in this unit. Attendance at the in-person lectures is not necessary, but is recommended. Remote students will have weekly Zoom sessions to discuss the course material in a semi-ordered fashion (i.e., not scripted lectures, and not a free-form discussion, but somewhere in between). This weekly session is optional for remote students.

Class material

The material for each class includes detailed explanations on the topic, interleaved with mathematical expressions, interactive diagrams,I had hoped to make all of these myself but it turns out this adds a significant additional overhead to producing good quality course material. For this reason I have included excellent visualisations that are available on the internets (with license and citation). References to visualisations from elsewhere are explicitly listed in the Contributions of each page, and often in the text. and example code, to make it as easy as possible for you to intuit a concept. I encourage you to replicate and run the code examples given, and to play with the diagrams to understand these concepts.

I also strongly encourage you to never copy-and-paste the example code that is provided. Instead, read the code in one window and type it out (character by character) for yourself in another window. This achieves the exact same thing as doing copy-and-paste, except the activity of typing the code yourself will help increase your understanding of how the underlying software works, and solidify it in your mind.

Week 1: Fitting a model to data

Week 2: Inference

Week 3: Advanced models

Week 4: Supervised and unsupervised learning

Week 5: Introduction to neural networks

Week 6: Deep neural networks and causality

Assessment

There is no exam for this unit. Instead, there are lots of assessments:

In the problem sets and assignments you will be required to solve problems using the data analysis and/or machine learning methods that you have learned in class. You will be graded by the results that you find, the accompanying code that you provide with your reportYour code does not need to be beautiful. It just needs to reproduce your work when executed., and the accompanying text that explains your findings. Do not just submit figures and/or code: you must explain your logic and interpret the results appropriately.

I suggest that you use Google Colaboratory to work on your assignments and problem sets. Ideally you would submit your problem sets and assignments as a PDF (produced with LaTeX) that adequately answers each question and includes code to reproduce your analysis. However, if you would prefer, you can submit a Google Colab notebook, or a Jupyter notebook you have executed locally, etc. But your code and results must be easily reproducible for every submission.

The Problem Sets are weekly. Each Problem Set will be released on Fridays for that week. For example, Problem Set 1 is based on material in Week 1. Problem Set 1 will be released on the Friday of Week 1, and will be due at 17:00 (Melbourne time) on Friday Week 2.

If the due dates for assessments in this class conflict with major assessments for other honours classes, please communicate this as soon as it is known. If the class communicates this to me well in advance then I will consider giving a class extension on the assessment.

Data analysisand machine learning