The clustering problem
Given
Use
Summarizing data is important for generalization, understanding future data points.
Supplying labels to unlabeled data; thence classifying new data-points. Eg: Face recognition using Eigenfaces. Use to estimate distribution support and thence in novelty detection.
Criteria: Continuity vs compactness
Consider a starfish: The continuity critierion will identify the 5 arms as 5 clusters, but the comactness criterion will fail. Consider points produced from two gaussians with distinct centers: the compactness criterion appears better.
Extensions
Coclustering data-points along with features.
Find Non-redundant clusterings
Aka Disparate clusters. Want clusterings based on unrelated criteria. Eg: can classify people based on sex, race etc..
With background clutter
Consider clustering stars in the night sky to find galaxies. Important for clustering algorithm to ignore clutter.
Evaluation of clustering
In case the
Challenges
Curse of dimensionality
As dimensions/ features grow, more difficult to group similar things together; there could be irrelevant or noisy features.
Use dimentionality reduction techniques.
Number of clusters k
The actual number depends on the data. Choice of
Some methods take
Identify important clusters
Clusters which matter remain distinct at different levels of coarsity.
Cluster visualization
Look at clusters in 2D or 3D at varying coarsity levels to identify important clusters.
Approaches
Views of the data
Visualize data points as points in feature space.
Or as ‘data points (X) vs features (Y)’ matrix A. In case features are binary (eg: word present or absent), get contingency table P(X, Y), an estimate of the joint probability matrix.
Density estimation
Parametric
Try to find distribution of data in the input space. Usually fit some multimodal distribution: the different modes define different cluster representatives.
Eg: Fit a mixture of
Advantage of having a parametric model: can extrapolate what the data will look like with more sampling.
Non-Parametric
Can do non-parametric density inference, then do density shaving: ignore points with low density, identify the modes.
Relative performance
Non parametric methods usually perform better, given enough data-points.
Centroid based vs agglomerative clustering
Described fully in another section.
Centroid based clustering methods are usually fast: the costliest step is often assignment of points to clusters: O(kn). Agglomerative methods usually involve comparison between all cluster-pairs for the purpose of agglomeration; so are slower:
Centroid based methods require
Agglomerative clustering
Bottom up approach. Start off with
Intercluster metrics
Distance between means. Or between closest pair of points: tends to produce elongated clusters: clustering by continuity criterion. Or between farthest pair of points: clustering by compactness criterion. These correspond to the 2 clustering criteria.
Centroid based clustering
Use centroids or cluster representatives to cluster.
Mean: Best cluster representative wrt Bregman div
Given Bregman div
k means clustering
Objective
As minimizing within cluster scatter
Find
As maximizing inter-cluster scatter
Use scatter matrices/ scalars \
Algorithm
Start of with
If
Time: O(kndt): very fast.
As low rank factorization with alternating minimization
Let X be the
So, k-means is equivalent to solving
Drawbacks and extensions
This is a greedy algorithm, does local minimization of the objective function. Highly sensitive to initial conditions. Can end up with empty clusters, with bad initialization. So, have varied initialization strategies.
Fails to cluster data points arranged in concentric rings. So use the kernel trick here: get kernel
With GMM
You model the observed data with
Start with some guessed parameters. Repeat the following steps iteratively: a] Assign each point to the
Generalizing k-means
k-means corresponds to GMM clustering with each Normal Distribution in the model constrained to being spherical: at each step you assign the point to the cluster with
With non-negative matrix factorization
Let X be the
Finding the initialization points
Bad initialization points can lead to bad clusters, good ones lead to good clusters. Density estimation useful here. \tbc
Co-clustering
Cluster both the rows and columns of P simultaneously. Thus dealing with duplicate/ synonymous/ irrelevant features simultaneously while clustering.
Objective: Information loss minimizing
Find maps
The monotonic optimizer
So,
Thence information theoretic coclustering alg: Start with
Using Graph clustering
For graph clustering methods, see graph theory ref.