All you need to know about Clustering

Why have I selected clustering?
Clustering is my favourite method, as we have many algorithms in supervised learning, but we have a limited number of algorithms in unsupervised learning. Clustering has a wide variety of use cases (like grouping, recommendation, image compression and segmentation, medical imaging, anomaly detection, etc.) as compared to other ML algorithms, and that’s what I like about clustering algorithms more.
Example of image compression using my own-coloured image as input. I created the 30 clusters from the input image and then used the centroid to compress the image.
What is clustering and its variants?
Clustering is an algorithm used to draw inferences from unlabeled data. It finds data elements that are similar and draws a different area of space where the data is concentrated. There is no error- or reward-based solution to evaluate the performance of the solution. These properties distinguish unsupervised learning from supervised and reinforcement learning. Clustering algorithms use distance-based measurements (Euclidean, Manhattan, etc.) to form the clusters. Hence, we need to standardise the data before applying clustering algorithms.
Profiling clusters based on business knowledge is a crucial approach. As an example, you have a list of customers that you need to divide into 3 segments for defining the marketing strategies. For assigning cluster names as high, medium, and low band customers. You would need to use business knowledge to profile the cluster, which can help you decide the different cluster business names.
There are four types of clustering approaches.
1. Connectivity methods
All data points are grouped into single clusters and then portioned as the distance between them grows. Although these models are highly interpretable, they are not scalable enough to deal with massive datasets. One example is hierarchical clustering.
2. Centroid methods
These are iterative clustering methods in which similarity is determined by the proximity of a data point to the cluster’s centroid. Kmeans and its variants, like K-median or Medoids, are examples.
3. Distribution Methods
This strategy is based on the idea of how likely it is that all data points in the cluster correspond to the same distribution (for example, normal, gaussian, and so on). An example is the expectation-maximization algorithm.
4. Density Methods
These models isolate distinct density zones and then group the data samples within these regions into the same cluster. Models of density such as DBSCAN and OPTICS are widely used.
Clusters Evaluation
The tricky part about clustering is finding the optimal number of clusters and their accuracy because the input data is unlabeled. Below is a list of methods that can help us validate the clusters.
1. Elbow method
The elbow approach estimates a good number of clusters based on the sum of squared distances (SSE) between data points and the centroids of their assigned clusters. We choose k at the point where SSE begins to flatten and create an elbow.
2. Silhouette analysis
The degree of separation between clusters can be determined via silhouette analysis. The coefficient’s possible values fall between [-1, 1].
• If it is zero, the sample is very near its neighbours.
• It is 1, indicating that the sample is remote from the neighbouring clusters.
• It is -1, indicating that the sample was incorrectly assigned to the clusters.
3. Business-level validation
After forming the cluster, we work with stockholders to validate the accuracy of clusters by sharing random samples for validation or by profiling each cluster and sharing the statics with business stack holders.
When to use what
Use K-means when several clusters are known by business problem definition or can be estimated. It is suitable for large data sets. It works well on globular or spherical shapes of data.
Use hierarchical clustering when data is hierarchical in nature or a nested cluster is our target. When several clusters are not predefined, then a hierarchical cluster is a good approach. However, it fails on big data due to its performance complexity.
Use DBSCAN when the data shape is an unknown example If you want to cluster the logs, then DBSCAN performs well as compared to k-means. It is also suitable for data with noise and outliers, as it can easily identify those points.
OPTICS is a combination of hierarchical and DBSCAN clustering, and it is useful for hierarchical density-based clustering in large datasets. I have used OPTICS to cluster to group the nearby geolocation using geo distance. Optics performed well as compared to K-means and DBSCAN.
Future with deep learning and GenAI
In the past 3 to 4 years, strong research has happened in clustering using deep learning. Below is a list of methods.
1. Autoencoders for Clustering
For unsupervised clustering, autoencoders, which are neural networks trained to reproduce their input, are used. By learning a compressed version of the data, they can discover the underlying patterns for clustering.
2. Deep Embedded Clustering (DEC)
This method combines deep learning and classical clustering by concurrently improving the weights and cluster assignments of a neural network to increase clustering accuracy.
3. Variational Autoencoders (VAEs)
VAEs are a sort of generative model that learns latent data representations. They have been investigated for clustering using their learned latent spaces.
4. Graph Neural Networks (GNNs) for Graph Clustering
GNNs are used to learn graph-structured data representations, which can then be applied to clustering issues in graph-based datasets.
5. Reinforcement Learning-based Clustering
Some research has investigated how to use reinforcement learning approaches to improve clustering algorithms and how to optimize clustering objectives using reinforcement learning strategies.
Clustering performance has been improved using Gen AI and LLM models. Recently, the three methods below have been tried, but LLM+Kmeans performed better than Kmeans and K-prototype. Nutshell’s idea is to convert structured data into text and then convert it into embedding. So, Gen AI can help us understand complex features of data, which can help improve model accuracy.
The K-means + Embedding model is more optimal, according to the results, since it needs fewer variables to produce accurate predictions.