A Probabilistic Analysis of
Frequent
Itemset Mining with Noise
The increasing availability (and variability) of
large
scientific and commercial data sets has fueled research by computer
scientists
and statisticians in the field of data mining. A number of common
and
important data mining problems can be summarized in the following way:
given an
m x n data matrix X, identify all the submatrices of X satisfying some
criterion. In the frequent itemset problem, X is binary and we
are
interested in finding large submatrices of 1s.
In this talk we will address several statistical questions related to
frequent
itemset mining and related algorithms. The first concerns the
significance of submatrices found by frequent itemset mining. The
second
concerns the noise sensitivity of frequent itemset mining, a problem
that has
not received much attention in the computer science literature.
Finally,
we will discuss a new, noise tolerant criterion for finding frequent
itemsets,
and will examine its use in recovering small submatrices of 1s in the
presence
of noise.
The talk is intended for a general audience and does not assume any
prior
knowledge of data mining.
This is joint work with Xing Sun.