Penalized Clustering of Large
Scale Functional Data with Multiple Covariates
Large scale longitudinal data with repeated measurements over a number
of time
points rise from many scientific investigations. A typical example is
that of
temporal gene expression studies, in which a series of micorarray
experiments
are conducted sequentially during a biological process. At each time
point,
mRNA expression levels of thousands of genes are measured
simultaneously.
Collected over time, a gene's ``temporal expression profile'' gives the
scientist
some clues on what role this gene might play. A group of genes with
similar
profiles are often "co-regulated" or participants of a common and
important biological function. Many clustering techniques have thus
been
applied to reveal the cluster information, which is a crucial first
step to
decipher the underlying mechanism.
In addition to the time factor, such longitudinal data often contain
many covariates,
e.g. replicates at each time point, species in comparative genomics
studies,
and treatment groups in case-control studies, as well as many factors
in
factorial designed experiments.
However, very few current available clustering methods take into
account all these
factors. Moreover, the computational costs of these methods are very
expensive
for large scale data.
To overcome these obstacles, we propose a penalized clustering method
for large
scale data with multiple covariates using functional data
approach.
Simulation studies and read-data examples are presented to investigate
the
empirical performance of proposed method.