Clustering

Data analysis and machine learning

\[ \newcommand\vectheta{\boldsymbol{\theta}} \newcommand\vecdata{\mathbf{y}} \newcommand\model{\boldsymbol{\mathcal{M}}} \newcommand\vecx{\boldsymbol{x}} \newcommand{\transpose}{^{\scriptscriptstyle \top}} \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\cov}{\mathbf{C}} \]

Clustering is then process of grouping data points together based on a metric: either the proximity between data points, or density changes, or something else. There are many different clustering algorithms around, and knowing which one is suitable for your problem can be difficult. This is partly because if you have some data that you want to perform clustering on, you often don't have a generative model for what described the data. If you knew all your data were drawn from a mixture of Gaussians, then using a Gaussian mixture model is the correct model to use for clustering. But if you have no generative description for your data, how are you to choose which clustering algorithm is appropriate? Today we will cover:

Centroid-based clustering
Density-based clustering
Hierarchical clustering
Partial- and full-assignment algorithms
Heuristics for choosing algorithms
Features and caveats to a few well-known clustering algorithms

You might be getting the feeling that you can make up any clustering algorithm that you want.You can. The general properties of a full assignmentSee below for definitions of full assignment and partial assignment. clustering algorithm will be some distance-based measure (or density-measure) for similarity, and some measure or heuristic for what describes things as being dissimilar. For partial assignment clustering algorithms you will still need some proximity measure (in distance or density), which could vary per cluster, and you will need to assign partial responsibility to every data point.

Gaussian Mixture Models

Gaussian mixture models are an example of clustering. You know this already from what we covered in Gaussian mixtures, generative models, and the Expectation-Maximization algorithm. When you don't have a good generative model for your data, Gaussian mixtures can be a defensible model of choice (i.e., a good one to try).

K-Means

K-Means is the simplest clustering algorithm that exists. You can think of K-Means as being like a Gaussian mixture model, except all of the covariance matrices of the mixturesOr variances, if you only have one dimension. are forced to be the same. It would be like performing Expectation-Maximization but forcing every Gaussian component to have the exact same variance at every step.

More specifically, K-Means assumes that every cluster is convex and isotropic. Convex in that it has a well-defined shape, and isotropic in that the cluster is not elongated in any dimension. This is often not the case in reality. As a result, K-Means performs poorly with highly covariant (elongated) clusters, or on groups with irregular shapes.

K-Means is a full-assignment algorithm. Remember when we covered Gaussian mixture models, and we allowed for the possibility that one data point could be 90% described by one component density and 10% described by another component density? That is an example of partial assignment, where data points can be partially described by multiple components. K-Means is a full-assignment algorithm, which means that every data point must be fully described by one -- and only one -- cluster.

The K-Means algorithm tries to group the data \(\vec{x}\) into \(K\) different clusters, each which has a mean \(\vec{mu}\). Because K-Means is a full-assignment algorithm, there are no mixing proportions that we need to solve for like we did in Gaussian mixture models. And because all components share the same variance, we don't need to solve for them either! But we do need to solve for the cluster means (centroids). K-Means aims to find centroids that minimise the so-called inertia, or within-cluster squared-sum variance: \[ \sum_{i=0}^{N}\min_{\mu_k \in K}\left(||x_i - \mu_k||^2\right) \]

How the algorithm takes steps to minimise this quantity is not relevant here. What is relevant is that K-Means will always converge to some solution, but that solution may be a local minimum. That means your choice of initialisation matters! People often run multiple iterations of K-Means from different initialisation points. And it is common to initialise the problem as best as you can, using something like K-Means++.

Every clustering algorithm has hyper-parameters, and the most important one for the K-Means algorithm is the number of clusters. That's right: you have to specify the number of clusters to K-Means before running it. That is to say that you need to both have your data, and you need to decide on the number of clusters there are in your data, before you can run the K-Means clustering algorithm. Typically when this the number of \(K\) clusters is not known, people usually run K-Means with many different values of \(K\), and use some heuristic to decide which is most suitable.

DBSCAN

The density-based spatial clustering of applications with noise (DBSCAN) is a popular density-based clustering algorithm. This differs in category from the K-Means algorithm, which is a centroid-based clustering algorithm. It is a full-assignment algorithm, and tries to group data points together in clusters of high density that are separated by low density regions. This is a generalised approach that does not make assumptions about the shape of the clusters (like K-Means). 'Density' here is based on some metric of proximity between data points (a distance measure), such that data points close to the central density are assigned to a cluster. The number of data points assigned to a cluster is called the 'sample size'Terminology sucks., and as a consequence there are two relevant hyper-parameters in DBSCAN: the minimum number of samples in a cluster, and the minimum epsilon of density needed to form a cluster. Thee minimum number of samples to form a cluster is intuitive, but the minimum density epsilon will depend on the values and dimensionality of the data set. So it can be different for every problem!

One feature of DBSCAN is that if there are data points that are not close enough to any cluster (by the distance measure), then those data points are not clustered together at all: they are considered outliers. That is good in some situations where you expect there to be clustering in your data, but you do not assume that every single data point was drawn from some mixture of clusters! The downside to this feature is that data points that are at the edge of a true cluster can be incorrectly marked as outliers instead of being assigned to the cluster. That is to say that you should use DBSCAN to identify the locations of your clusters, but do not take the cluster assignments as the final truth: you should examine whether the density bounds for the clusters found is suitable for your problem at hand.

DBSCAN starts by picking a random point in the data set, and looking at the points closest to that point, in order, and assigning them to a cluster if they meet the minimum density. Points that are clustered together but far away from our initial point will be added to a new cluster. Points that are far away from all clusters (in density space) will not be assigned to any cluster. Here is a nice visualisation.That I should make a to-do note to implement here.

Hierarchical Clustering

Hierarchical clustering is a generic description of algorithms that successively try to cluster data points together, and then sub-split those clusters into smaller families. If used successfully, this can provide a kind of hierarchy of clusters, which could have practical interpretations for your data set. One representation of the results from a hierarchical clustering algorithm would be something like a tree, or a dendrogram, representing the successive hierarchy.

That's a generic description for hierarchical clustering as a topic. To implement it in practice you could either go from top-to-bottom, or bottom-to-top. For example:

top-to-bottom: consider the whole data set as a single cluster, then split the data orthogonal to the first \(N\) principal components in order to maximise the separation of variance. Then repeat this process until some heuristic is met for each sub-sub-sub-...-group.
bottom-to-top: like DBSCAN, start by assigning data points to a cluster and gradually building a parent hiearchy when the cluster becomes too large.

There are algorithms for both scenarios. Here we will introduce an agglomerative clustering approach (bottom-to-top). Some common heuristics for deciding whether a cluster has become 'too large' and needs splitting are Ward, maximum linkage, average linkage, or single linkage. Ward is a bit like K-Means in that it tries to minimise the in-cluster variance by taking the sum of squared differences. Maximum linkage will try to minimise the maximum distance between pairs of clusters. Average linkage will try to minimise the average distances between pairs of clusters. Single linkage will minimise the distance between the closest pairs of observations. For your problem, none of these heuristics might be suitable, but all of them might be sufficient.

There are millions of other clustering algorithms

Here we have introduced you to the basic idea of clustering, including things like partial assignment, full assignment, how some different algorithms work, and general caveats about how no clustering algorithm is perfectly suited to your particular problem. You will have to make decisions where there is no right answer! I would encourage you (and require you, in the problem set) to explore different clustering algorithms and their performance on a variety of data sets. This will help give you intuition for the different methods available.

Summary comments

There are many clustering algorithms available. Hundreds more are not mentioned here. If you find yourself in the situation where one clustering algorithm is the one to use for your problem, then you are in a very lucky state: that does not often happen.

Sometimes you can't cluster in the data space because it is too large. Then you need to do dimensionality reduction first (e.g., PCA) and cluster in the space of the first \(L\) eigenvectors. Or you may want to perform some transformation of the data (e.g., a fourier transform) and cluster in that space. This is a common thing that is done for edge detection and other signal processing ideas. So whether clustering is relevant, correct, or even suitable for your purposes will depend on what exactly you want to achieve.

Usually a practictioner will have to ask:

Why am I clustering these data? What do I want to learn or infer from the data?
Since the clustering algorithm I choose is almost guaranteed not to be representative of the generative model that produced the data, what artefacts in the clustering outputs do I need to be worried about?
How much can I believe the clustering results? What things can I cross-check to make sure they look sensible?
Even though you may disbelieve the clustering results, are they sufficient for my purpose? That is to say: even if the model is wrong, does it still have utility?

In the next class we will introduce you to neural networks, from scratch.

← Previous class
Dimensionality reduction

Next class →
Neural networks I

Contributions

The comparison of hierarchical clustering distance measures came from the Scikit-learn documentation.