Fall Seminar Series  November 29, 2007
University of Minnesota
School of Statistics
College
of Liberal Arts

Recent Advances in Clustering with Applications

Jia Li
Department of Statistics
  The Pennsylvania State University

Thursday, November 29, 2007
3:30 PM, 115 Ford Hall
Minneapolis, East Bank Campus
Social at 3:00 PM, 300 Ford Hall

 

Abstract

Recent advances in clustering from two directions will be presented.  First, a new clustering approach based on mode identification and 
kernel density estimate will be introduced.  A recently developed optimization algorithm, namely, the Modal EM (MEM), finds an 
ascending path from an arbitrary point to a local maximum (mode) of a density in the form of mixture distributions.  A cluster is 
formed by those sample points that ascend to the same mode of the density function.  This method is then extended for hierarchical 
clustering by recursively locating modes of kernel density estimators with increasing bandwidths.  In mode-based clustering, the 
role of mixture modeling is concentrated on density estimation (rather than capturing clusters in the mean time), and hence the result 
is more robust when clusters deviate substantially from Gaussian distributions.  The study on the geometric characteristics of mixture 
distributions is further deepened by an algorithm called Ridgeline EM (REM) which efficiently solves the ridgeline between the 
density bumps of two clusters.  Theoretical properties of the ridgeline make it powerful and convenient for diagnosing clustering 
results and quantifying the separability between clusters.
 
In the second part of the talk, we consider clustering objects represented by sets of weighted vectors in contrast to vectors.  Weighted 
vector sets are formulated as discrete distributions with finite but arbitrary support.  A new clustering algorithm, namely D2-clustering 
(D2 stands for discrete distribution), is developed using linear programming to minimize the sum of Mallows distances between sample 
points and their corresponding cluster centroids.  Combined with a generalized mixture modeling method based on the concept of 
hypothetical local mapping, D2-clustering is applied to real-time image annotation and is the core of ALIPR, an online automatic 
image tagging system.