Fall Seminar Series  October 19, 2006
University of Minnesota
School of Statistics
College of Liberal Arts

A Probabilistic Analysis of Frequent Itemset Mining with Noise

Andrew Nobel
Department of Statistics and Operations Research
University of North Carolina

Thursday, October 19, 2006
3:30 PM, 115 Ford Hall
Minneapolis, East Bank Campus
Social at 3:00 PM, 300 Ford Hall


Abstract

The increasing availability (and variability) of large scientific and commercial data sets has fueled research by computer scientists and statisticians in the field of data mining.  A number of common and important data mining problems can be summarized in the following way: given an m x n data matrix X, identify all the submatrices of X satisfying some criterion.  In the frequent itemset problem, X is binary and we are interested in finding large submatrices of 1s.

In this talk we will address several statistical questions related to frequent itemset mining and related algorithms.  The first concerns the significance of submatrices found by frequent itemset mining.  The second concerns the noise sensitivity of frequent itemset mining, a problem that has not received much attention in the computer science literature.  Finally, we will discuss a new, noise tolerant criterion for finding frequent itemsets, and will examine its use in recovering small submatrices of 1s in the presence of noise.

The talk is intended for a general audience and does not assume any prior knowledge of data mining.

This is joint work with Xing Sun.