UL — Clustering

3 min readApr 23, 2020

One of the classic unsupervised learning problem is — clustering. Given several points, along with relations (or distances) between these points, create partitions such that points closer to each other, in terms of relation or distance reside in same cluster.

Single Linkage Clustering (SLC)

SLC is a clustering algorithm which is simple and intuitive. It is also known as hierarchical agglomerative clustering (HAC). The algorithm in simplest form can be given as follows:

- consider each object cluster (say for n objects)
- define inter cluster distance (some form of relation) as the distance between two closest points in the two nearest cluster
- merge two closest clusters
- repeat n-k times to make n clusters

There are three variation to this algorithm as follows:

Single Link Clustering — Two clusters whose two closest members have the smallest distance are merged by iteration
Complete Link Clustering — Two clusters are merged together, when they are merged will have smallest diameter
Average Link Clustering — Two clusters are merged together, when then have the shortest distance between the averages of the distances between their own points.

K Means Clustering

The algorithm for K Means Clustering is given as follows:

- pick k centers (randomly)
- each center will claim its closest points
- recompute the centers by averaging the clustered points
- repeat until convergence

What is the difference between K Means and Average Link Clustering that we saw previously? — While K Means is top down approach, Average Link Clustering is bottom up approach.

For K Means clustering, a fix number of clusters should be specified, along with distance metrics. K Means algorithm will produce cluster centroids and an assignment of each point to clusters.

Soft Clustering

As the name suggest, in soft clustering the assignment of a point is ‘soft’ to any given cluster. Consider a situation where points (observations) belong to only two distinct classes. Clustering is done, as we don’t have labels assigned to data set, however, if a few points (observations) belong to multiple clusters (labels) how do we decide which label is correct? In Soft clustering we let the observations be in both the clusters. For example — is virus a living entity or dead?

The detailed discussion is given here: https://www.jeremyjordan.me/gaussian-mixed-models/

Expectation Maximization

This algorithm is similar to K Means. Take randomly generated points across two dimensions. The EM algorithm will take 2 points from these points at random, if two clusters are desired. Then the algorithm will run iterations of Expectations and Maximization. So first assign the observations to each of the selected points, and then calculate the means to move those 2 points to center of newly assigned points. This process is iterated multiple times to get the desired cluster.

As multiple iterations of EM are run, the algorithm will converge. The K-Means never gets worse in error metrics, so as long as there is way to break the ties, K Means will eventually stop, this may not happen in EM (it might get stuck) as the configurations are probabilities. This may not happen as the algorithm converges, however there is a possibility. Solution is to randomly restart.

Clustering Properties

There are some desirable properties to the four clustering techniques/algorithms that we discussed above. They are as follows:

Richness — For any assignment of objects to the clusters, there is some distance matrix D such that PD returns that clustering.

Scale invariance — Scaling the distances by a positive value does not change the clustering.

Consistency — Shrinking ‘intra-cluster’ distances and/or expanding ‘inter-cluster’ distances does not change the clustering.

Note — there is no clustering algorithm that can achieve all the three properties.

Conclusion

We looked at various clustering algorithms — K Means, SLC & EM, along with details. We also looked at desirable clustering properties to be considered and impossibility.

References — Heavily borrowed from my notes on CS7641, so thanks to Prof Charles Isbell and Prof Michael Littman. Errors are all mine.

UL — Clustering

Written by Ganesh Walavalkar