# Back to Basics with David Mackay #2: Fancy k-means

Following David Mackay’s book along with his videos online has been a real joy. In lecture 11, as an example of an inference problem, he goes over many variations of the k-means algorithm. Let’s check these out.

## The datasets

All the sample data used in his lectures and the book can be found on his website here. It also has octave code for the algorithms but we’ll implement them again in python.

## The simple k-means

This is now a classic data science interview question and you probably know how to do this in your sleep. Here’s my simple implementation of it. You can make this a lot more efficient by using numpy’s vectorized functions but I don’t bother with it here.

and here’s the code that runs the simulation which doesn’t change much between the different versions so we’ll skip it going forward.

We’re going to look at the 7 different datasets and see how this code does in picking up the clusters.

It does a fine job with the first ones but struggles with the rest. One problem is that we dichotomize the allocation for each point - it either belongs to cluster 1 or cluster 2. Some points may be ambiguous and we want to take that into account. Let’s fix that in the next section.

## Simple k-means with soft-thresholding

Now each cluster has a “responsibility” for each point. The total responsibility for a point adds up to 1. We do a softmax instead of the hard threshold in the assignment step. And when updating the cluster center, we weight the points based on this responsibility.

Note the new parameter beta that we now need to set. The higher the beta, the more hard the thresholding. In its limit, it approaches the basic k-means algorithm.

Here are the results with beta set to 2:

Do we do any better? Well, not really. Though we are able to identify the ones we are uncertain about - the circles in light-blue - we now have an additional parameter, beta, to set. The other limitation of the algorithm is that each cluster is assumed to be the same size. We see from the data that this is not always true. Let’s allow for variable sizes in the next section.

## K-means with variable cluster sizes

Now we are going to think of each cluster as a gaussian. They can be of varying sizes i.e. different variances. So the distance of each point to the gaussian center is the likelihood of observing that point given the gaussian parameters weighted by the cluster’s importance (pi). And what is importance? The total amount of relative responsibility of the cluster across all the points.

So we are still keeping the idea of responsibility around but look ma, no beta!. Our update and assignment steps now need to keep track of importance and also update the variance or size for each of the clusters.

The update step suddenly looks quite long. This is because, we need to handle the corner cases where a cluster has no importance or has just one point assigned to it, reducing the variance to almost zero. The algorithm falls over during those scenarios. We don’t change the cluster details if it has no importance and if sigma shrinks to zero, we set it to something small - 0.1 in this case. You may have more elegant ways of handling this.

So what did we get for all this hard work?

Good news is that all the ambiguity in the second and fourth dataset is now gone - we have a small cluster and a large cluster. Bad news is that there is a little more ambiguity about the some of the peripheral points in the first dataset since the blue cluster is now larger and more important. It still does a shitty job of the circles and the last dataset with skinny clusters.

We can’t do much about the circles unless we change coordinate systems / do some transformations and I’ll leave that out for now (another reason to use HDBSCAN or Spectral Clustering if this is a real life problem). But we can do something about the skinny clusters. So far we have kept the cluster round - i.e. the variance in the two dimensions are constrained to be equal. Let’s relax that in the next section.

I’m adding a try/except since the covariance matrix becomes invalid under some scenarios (the same ones as above). In these cases, we just pick a tiny round cluster.

The assignment step is exactly the same as before. So let’s look at the update step.

Here are the results:

Not too bad, I think. Datasets 2 and 4 look ok. Circle ones are still rubbish. The last one with long clusters is perfect. Dataset 5 is a bit weird, but sure! That is valid way of clustering.

## Conclusions

I cheated a little. EM algorithms tend to be sensitive to initial conditions. There is a lot of literature on how to handle that. If you are going to use this in real life for some insane reason, at least run it multiple times from different initial conditions to make sure it is stable. If not, take some sort of ensemble of all of them.

Here are all the results together.

As always you can find the notebook with the code online. You may want to try these again with three clusters.