Recent Advances in Clustering
with Applications
Recent advances in clustering from two directions will be presented. First, a new clustering approach based on mode identification and kernel density estimate will be introduced. A recently developed optimization algorithm, namely, the Modal EM (MEM), finds an ascending path from an arbitrary point to a local maximum (mode) of a density in the form of mixture distributions. A cluster is formed by those sample points that ascend to the same mode of the density function. This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. In mode-based clustering, the role of mixture modeling is concentrated on density estimation (rather than capturing clusters in the mean time), and hence the result is more robust when clusters deviate substantially from Gaussian distributions. The study on the geometric characteristics of mixture distributions is further deepened by an algorithm called Ridgeline EM (REM) which efficiently solves the ridgeline between the density bumps of two clusters. Theoretical properties of the ridgeline make it powerful and convenient for diagnosing clustering results and quantifying the separability between clusters. In the second part of the talk, we consider clustering objects represented by sets of weighted vectors in contrast to vectors. Weighted vector sets are formulated as discrete distributions with finite but arbitrary support. A new clustering algorithm, namely D2-clustering (D2 stands for discrete distribution), is developed using linear programming to minimize the sum of Mallows distances between sample points and their corresponding cluster centroids. Combined with a generalized mixture modeling method based on the concept of hypothetical local mapping, D2-clustering is applied to real-time image annotation and is the core of ALIPR, an online automatic image tagging system.