pic

Abstract

Jun Liu, Computational Approaches to Gene Regulation

  1. Understanding how genes are regulated in various circumstances (e.g., heatshock, starvation, etc.) is a central problem in molecular biology. The adoption of large-scale biological data generation techniques such as the mRNA microarrays has enabled researchers to tackle the gene regulation problem in a global way. I will survey some computational and statistical strategies developed by our group on how to effectively use the gene upstream sequence information in junction with mRNA expression microarray data to dissect the gene regulatory network. I will describe in detail a study of RacA binding activities in Bacillus Subtilis, explaining how statistical approaches helped the biologists discover RacA's binding sites. I will describe a new dimension reduction technique that has been applied successfully to our gene regulation studies and show a cute theorem supporting the technique.

Cheng Li, Allelic imbalance analysis of tumor samples using SNP microarrays

  1. Loss of heterozygosity (LOH) and copy number changes of chromosomal regions bearing tumor suppressor genes or oncogenes are keys event in the evolution of epithelial and mesenchymal tumors. Identification of LOH regions usually relies on genotyping tumor and counterpart normal DNA and noting regions where heterozygous alleles in the normal DNA become homozygous in the tumor. However, paired normal samples for tumors and cell lines are often not available. With the advent of oligonucleotide arrays that simultaneously assay thousands of single-nucleotide polymorphism (SNP) markers, genotyping can now be done at high enough resolution to allow identification of LOH events by the absence of heterozygous loci, without comparison to normal controls. Here we describe a hidden Markov model-based method to identify LOH from unpaired tumor samples, taking into account SNP intermarker distances, SNP-specific heterozygosity rates, and the haplotype structure of the human genome. In addition, copy number analysis incorporating LOH will be discussed.

Jun Yu, The Genetic Code: an evolutionary creation between the informational content and the functional content

  1. Abstract

Ming Yuan, Hidden Markov models for microarray time-course data

  1. Most statistical methods to analyze microarray time course data attempt to group genes sharing similar temporal profiles within a single biological condition. With time-course data in multiple conditions, a main goal is to identify differential expression patterns over time. A simple approach would be to consider each time point in isolation and combine results from repeated marginal analyses. However, doing so does not utilize the dependence structure. This can be a serious drawback, particularly for microarray studies where low sensitivity is observed for many methods. We propose a Hidden Markov modeling approach developed to efficiently identify differentially expressed genes and classify genes based on their temporal expression patterns. Simulation studies demonstrate a substantial increase in sensitivity, with little increase in the false discovery rate, when compared to a marginal analysis at each time point. This increase also is observed in data from a case study of the effects of aging on stress response in heart tissue, where a significantly larger number of genes are identified using the proposed approach.

Nick Johnson, An alternative to the moderated t statistic for ranking differentially expressed genes

  1. The moderated t statistic is widely used in two-group microarray experiments through software packages such as SAM. We propose an alternative statistic to the moderated t with the goal of producing more interesting top-end lists through software packages such as SAM. We propose an alternative statistic to the moderated t with the goal of producing more interesting top-end lists of differentially expressed genes. We try to simultaneously take into account the difference and ratio of group means as well as the statistical significance of the observed differences. Sorting is purposefully not done by p-value although statistical significance is taken into account and false discovery rates at various list sizes can be estimated. We demonstrate improved ranking performance in simulations and assess the robustness of the statistic in real microarray data.

    This work is joint with Wing H. Wong.

David Ballard, Data imputation and power evaluation in genetical genomics

  1. Utilizing a genetic cross and expression profile data to identify genes that cause a disease has gained recent attention with genetical genomics. The first step to isolate candidate causal genes from the tens of thousands of possible genes is to separate out genes that attain a certain threshold of similarity to the disease trait. For example, an initial study uses the Pearson correlation coefficient of a gene expression trait and a quantitative disease trait. Unfortunately, missing data is the rule rather than the exception. We investigate the potential of using gene expression data to predict missing disease trait values, and using the imputed data to identify candidate genes.

    This work is joint with Hongyu Zhao.

Ker-Chau Li, Trait-trait dynamic interaction: 2D-trait eQTL mapping for genetic variation study

  1. Many studies have shown that gene expression variation is inheritable. Analogous to the traditional genetic study, most researchers treat the variation in expression of a gene as a quantitative trait and map it to expression quantitative trait loci (eQTL). This common approach can be described as a "one-dimensional-trait (1D-trait) mapping" because each trait is mapped separately. 1D-trait mapping ignores the trait - trait interaction completely, which is a major shortcoming. To overcome this limitation, we introduce a novel concept of 2D-trait mapping. We report several applications by combining 1D-trait mapping with 2D-trait mapping, including the contribution of genetic variations to the perturbations in the regulatory mechanisms of yeast metabolic 2D-trait mapping, including the contribution of genetic variations to the perturbations in the regulatory mechanisms of yeast metabolic pathways.

    This talk is based on joint work with Wei Sun and Shingshen Yuan.

Xuegong Zhang, Understanding lymph node metastasis in breast cancers: a case study of microarray data analysis

  1. The current molecular biology and systems biology is featured by the rapid accumulation of high-throughput genomics data like those with DNA microarrays. Typical investigations include the use of microarray data for the molecular classification of cancers, and for discovering the genes underlying the classification. As an example, microarray data have shown success in many investigations on breast cancers, such as the classification of BRCA1, BRCA2 mutation and sporadic breast cancers, estrogen receptor negative (ER-) vs. positive (ER+), good prognosis vs. poor prognosis of breast cancer patients, primary breast carcinoma vs. corresponding metastases tissue, prediction of distant metastases, etc..
    Breast cancer can metastasize to the regional lymph nodes, to distant organs such as lung, liver, and brain, or to both regional and distant sites. Molecular events that underlie breast cancer progression are incompletely understood. Regional spread to axillary lymph nodes is a powerful prognostic factor and one component in the spectrum of breast cancer metastasis. The most reliable predictors of regional nodal metastasis currently available are pathologic features of the primary tumor, including tumor size and the presence of lymphovascular invasion. On a sample of 129 breast cancer patients, we investigated if molecular features of the primary cancer measured by Affymetrix U133plus2 microarrays predict lymph node status. A variety of statistical and machine learning approaches were taken in analyzing the data, and a bunch of features important in breast cancer were studied, including biomarkers (ER, Her2), histologic features (grade and lymphatic-vascular invasion or LVI), and stage parameters (tumor size and lymph node metastasis). The new understanding on the regional metastasis of breast cancers achieved from this study will be presented in this talk, as well as some useful suggestions observed from this study on how bioinformatics methods should be better utilized in investigating complicated biological questions.

    This work was a collaboration with Dana-Farber Cancer Institute, Brigham & Women's Hospital, Harvard Medical School, and Harvard School of Public Health.

Katherine Pollard, Detecting Lineage-Specific Evolution

  1. Genomic regions that vary in their patterns of sequence conservation across a phylogeny are interesting candidates for the study of evolutionary shifts in function. We have developed two comparative genomic methods for detecting lineage-specific evolution on a genome-wide scale. The first approach, called DLESS, is based on a phylogenetic hidden Markov model (phylo-HMM), which does not require the lineage of interest or the element boundaries to be determined a priori. Applying DLESS to the ENCODE regions of the human genome, we detected differences in patterns of loss and gain of conserved elements between coding and non-coding regions and between vertebrate clades. DLESS has very little power, however, to identify changes in substitution rate on a single lineage. To address this question, we developed a second method that begins with a set of ancestrally conserved elements and applies a likelihood ratio test to screen these for the subset whose substitution rate is significantly higher in a lineage of interest. With this approach we identified 202 Human Accelerated Regions (HARs), which are highly conserved among mammals but show a significant increase in the rate of substitutions in the human genome since divergence from the chimp-human ancestor. Bioinformatic characteristics of the HARs suggest that many are involved in the regulation of gene expression. The most dramatically accelerated region, HAR1, is part of a novel RNA gene (HAR1F) that is expressed during human cortical development.

Lexin Li, Sufficient dimension reduction with application to microarray data analysis

  1. Sufficient dimension reduction can aid the analysis of high-dimensional microarray data by transforming the problems to low dimensional projections. The curse of dimensionality is often alleviated, and the informative data visualization may be enabled. In this talk, we start with an application of a hybrid of two long standing dimension reduction methods, principal components analysis and sliced inverse regression, to a microarray survival data analysis. We also demonstrate how additional clinical information can be incorporated during the phase of dimension reduction. Such analyses have introduced new challenges to the methodology of sufficient dimension reduction as well, including the presence of highly correlated predictors, the small-n-large-p problem, and variable selection in the framework of dimension reduction. We next continue the talk with a discussion of some recently proposed regularized dimension reduction methods to address the above challenges. Some theoretical properties of the proposed methods will be explored, and the analysis of a microarray survival data will underlie this line of methodology development.

Shinsheng Yuan, Context-dependent clustering for dynamic cellular state modeling of microarray gene expression

  1. Motivation: High-throughput expression profiling allows researchers to study gene activities globally. Genes with similar expression profiles are likely to encode proteins that may participate in a common structural complex, metabolic pathway, or biological process. Many clustering, classification and dimension reduction approaches, powerful in elucidating the expression data, are based on this rationale. However, the converse of this common perception can be misleading. In fact, many biologically related genes turn out uncorrelated in expression.

    Results: In this paper, we present a novel method for investigating gene co-expression patterns. We assume the correlation between functionally related genes can be strengthened or weakened according to changes in some relevant, yet unknown, cellular states. We develop a context-dependent clustering (CDC) method to model the cellular state variable. We apply it to the transcription regulatory study for Saccharomyces cerevisiae, using the Stanford cell-cycle gene expression data. We investigate the co-expression patterns between transcription factors (TFs) and their target genes (TGs) predicted by the genome-wide location analysis of Harbison et. al (2004). Since TF regulates the expression of its TGs , correlation between TF's and TG's expression profiles can be expected. But as many authors have observed, the expression of transcription factors do not correlate well with the expression of their target genes. Instead of attributing the main reason to the lack of correlation between the transcript abundance and TF activity, we search for cellular conditions that would facilitate the TF-TG correlation. The results for sulfur amino acid pathway regulation by MET4, respiratory genes regulation by HAP4, and the branched chain amino acid biosynthesis regulation by LEU3 are discussed in detail. Our method suggests a new way to understand the complex biological system from microarray data.

Wei Pan, A nonparametric empirical Bayes approach to joint modeling of DNA-protein binding data and gene expression data

  1. With the rapid accumulation of various high-throughput genomic and proteomic data, it has become compelling to develop new statistical methods that can take advantage of existing multiple sources of data. In our motivating example, a chromatin-immunoprecipitation (ChIP) microarray experiment was conducted to detect binding target genes of a broad transcription regulator, leucine responsive regulatory protein (Lrp) in {\em E. coli}. In addition, a cDNA microarray dataset is available to compare gene expression of the wild type with that of a mutant with the Lrp gene deleted in {\em E. coli}. It is biologically reasonable to assume that the genes with altered expression are more likely to be regulated by Lrp than those with no expression change. Hence we aim to borrow information in the gene expression data to increase statistical power to detect the binding targets of Lrp. We propose a novel joint model for such two sources of data, protein-DNA binding data and gene expression data; under mild modeling assumptions, it is shown that the method is optimal, equivalent to a joint likelihood ratio test. We compare the joint modeling with two existing methods of combining separate analyses. We adapt a nonparametric empirical Bayes (EB) method to draw statistical inference in the joint model; in particular, we propose a new method, maximum likelihood conditional on the binding data, to estimate two prior probabilities for the expression data, which are non-identifiable based on the expression data alone. We use simulated data to demonstrate the improved performance of the joint modeling over other approaches. Application to the Lrp data also shows better performance of the joint modeling than that of analyzing the binding data alone.

    This is joint work with Kyeong S. Jeong, Yang Xie and Arkady Khodursky.

Hongzhe Li, Statistical Methods for Network-Based Analysis of Genomic Data

  1. A central problem in genomic research is the identification of genes and pathways involved in diseases and other biological processes. Many methods have been developed for identifying genes in a regression framework. The genes identified are often linked to known biological pathways through gene set enrichment analysis in order to identify the pathways involved. However, most of the procedures for identifying the biologically relevant genes do not utilize the known pathway information. In this talk, I present hidden Markov random field (HMRF)-based method for identifying genes and subnetworks that are related to diseases. Simulation studies indicated that the method is quite effective in identifying genes and subnetworks that are related to disease and has higher sensitivity and lower false discovery rates than the commonly used procedures that do not use the pathway structure information. Applications to two breast cancer microarray gene expression datasets identified several subnetworks on several of the KEGG transcriptional pathways that are related to breast cancer recurrence or survival due to breast cancer. Extension to analysis of time course gene expression data will also be discussed.

Hongyu Zhao, Data integration methods in reconstructing transcriptional regulatory networks

  1. Abstract

Yves Chretien, Meta-analysis for genome-wide association studies

  1. Given multiple genome-wide association studies of a single disease, how can the information provided by each about some marker be systematically pooled? Since different studies use different marker sets, combining information across studies is not a straightforward task. The HapMap provides a key piece of the puzzle by yielding estimates of the correlation structure between all markers genotyped in any of these studies. We propose a statistical framework for meta-analysis of several genome-wide association studies, using the HapMap, that allows inference about any given marker, regardless of whether or not that marker was genotyped directly in all of the studies.

Fengzhu Sun, The application of mixture models in the study of molecular networks

  1. Mixture models are widely used in many different fields. In this talk I will give two examples of using mixture models to the study the statistical properties of molecular networks. The first is to estimate the reliability of putative observed protein interaction data sets and the other is the identification of network motifs in stochastic molecular networks.

    Rui Jiang, Zhidong Tu, Ting Chen, Fengzhu Sun (2006) Network motif identification in stochastic networks. Proc Natl Acad Sci USA 103:9404-9409.
    Deng, MH, Sun FZ, Chen, T (2003) Assessment of the reliability of protein-protein interactions and protein function prediction . Pacific Symposium on Biocomputing, PSB2003, 8:140-15.

Henry Horng-Shing Lu, On statistical investigation for large bio logical networks

  1. Is it possible to develop simplified models to gain insights for large and complex biologic networks? This talk will discuss our attempts to develop statistical methods for this purpose that include network reconstruction by Boolean networks, studies of yeast transcription factors and evolution of the yeast protein interaction network. Future developments regarding this direction will be discussed as well.

Minghua Deng, Prediction of kinase functional sites using hierarchical language model

  1. Predicting functional sites in kinases is an essential problem in biology. Both the functional sites and the relationship among the amino acids within the sites need to be understood. We develop an algorithm for kinase functional site prediction using sequences data based on hierarchical stochastic language (HSL) modeling. The HSL model integrates the advantages of the word counting approach for motif finding together with the syntax. The HSL model first finds the keywords by the consensus of k-mers that characterize the functional sites, and then finds a stochastic grammar to constitute different types of sentences for each kinase functional family. By iteratively train the data, our algorithm can detects kinase sub-families automatically, and build the HSL for each sub-families.

    We validate our approach in three aspects. Firstly, we compare the predicted functional sites using the HSL model with the patterns in PROSITE and the contacting sites in PDB. The overall average sensitivity/specificity of the HSL model are 83.5%/23.0% and 66.1%/79.9%, respectively. Secondly, We used 10-fold cross-validation to evaluate our functional prediction for kinase based on the predicted functional sites. Our method achieves both higher sensitivity (94.7%) and specificity (94.0%) in 10-fold cross-validation compared to 94.5% and 85.8% for MEME. Finally, The HSL model automatically detects kinase sub-families. These sub-families fit well to the phylogenetic trees, indicating that our method is also applicable to kinase sequences with heterogeneous subsets sharing the same catalysis function.

Wentian Li, Causal Inference in Genetics

  1. Statistical correlation does not necessarily imply causal correlation, and causal correlation can be uncertain by pointing in either one of the two directions. Traditionally, not only there was no proposed method for inferring causal relationship from data without temporal information, but also in genetics, we know gene/genotype is the cause and disease/phenotype is the consequence. Here I argue that for studying intermediate phenotypes and biomarkers, causal inference can be useful. The key component in a causal inference is the presence of the third variable and conditional correlations. Also, it is important that this third variable cannot be an effect. We have successfully applied Cooper's "local causality discovery" rule to a patient dataset collected in the North American Rheumatoid Arthritis Consortium (NARAC) with these three variables: two biomarkers -- rheumatoid factor (RF) and anti-cyclic citrullinated peptide (anti-CCP), and one genotype at DRB1 locus on chromosome 6. Any two variables among these three variables are correlated with each other. The genotype is the third variable which helps to determine that anti-CCP biomarker is a "cause" and RF is an "effect". (Li et al., Bioinformatics, 22:1503-1507 (2006))

    This work is joint with Mingyi Wang, Patricia Irigoyen, and Peter Gregersen.

Mark Levenstien, Are molecular haplotypes worth the time and expense? A cost-effective method for applying molecular haplotypes

  1. Because current molecular haplotyping methods are expensive and not amenable to automation, many researchers rely on statistical methods to infer haplotype pairs from multilocus genotypes, and subsequently treat these inferred haplotype pairs as observations. These procedures are prone to haplotype misclassification. We examine the effect of these misclassification errors on the false-positive rate and power for two association tests. These tests include the standard likelihood ratio test (LRT std) and a likelihood ratio test that employs a double-sampling approach to allow for the misclassification inherent in the haplotype inference procedure (LRTae). We aim to determine the cost-benefit relationship of increasing the proportion of individuals with molecular haplotype measurements in addition to genotypes to raise the power gain of the LRTae over the LRTstd. This analysis should provide a guideline for determining the minimum number of molecular haplotypes required for desired power. Our simulations under the null hypothesis of equal haplotype frequencies in cases and controls indicate that (1) for each statistic, permutation methods maintain the correct type I error; (2) specific multilocus genotypes that are misclassified as the incorrect haplotype pair are consistently misclassified throughout each entire dataset; and (3) our simulations under the alternative hypothesis showed a significant power gain for the LRTae over the LRTstd for a subset of the parameter settings. Permutation methods should be used exclusively to determine significance for each statistic. For fixed cost, the power gain of the LRTae over the LRTstd varied depending on the relative costs of genotyping, molecular haplotyping, and phenotyping. The LRTae showed the greatest benefit over the LRTstd when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a replication study as opposed to a whole-genome association study.

Jurg Ott, Gene-gene interactions in case-control studies

  1. In case-control association studies, after genotyping individuals for 100K to 500K SNPs, the initial analysis taken by most researchers is to test each SNP for association, that is, to test whether genotype frequencies and/or allele frequencies are different in case and control individuals. While these tests may be useful for single disease genes of strong effects, it disregards the multi-genic nature of complex traits. What would be important is to look at association of multiple SNPs with disease.

    Several methods exist for multi-locus association analysis. One approach takes the sum over single-locus test statistics as an approximation to a multi-locus (multivariate) test statistic. This is an approach that is well-known in other fields of science (Manly 2006). For human case-control association studies such an approach has been implemented by Hoh et al (2003) and shown to be more powerful than SNP-by-SNP analysis.

    For a comprehensive multi-locus association analysis, one may try to find disease-associated sets of genotypes at different loci (genotype pattern) but the sheer number of possibilities makes this an essentially impossible task. To reduce the complexity of such a multi-locus approach, one may restrict attention to two SNPs at a time so that it might be possible to examine all m(m - 1)/2 pairs of SNPs, where m is the total number of SNPs. Methods for the analysis of gene-gene interaction will be presented and discussed. The main problem here is to find interactions that are not contaminated by main effects, that is, by genotype or allele effects at one or both of the SNPs involved.

    Hoh J, Wille A, Ott J (2001) Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11, 2115-2119

    Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology, 3rd Edition. Chapman & Hall/CRC

Tian Zheng, Studying co-regulation and inter-regulation of genes via eQTL mapping

  1. eQTL mapping is to find loci on human genome that have demonstrated linkage to or association with the expression of a gene in microarray hybridization experiments. Such identified loci may contain important information on the regulatory factors of the given gene under study. In this talk, I will discuss co-regulation and inter-regulation patterns identified via similar strategies.

Wenjiang Fu, Maternal-fetal genotype incompatibility and health outcome

  1. Single nucleotide polymorphisms (SNPs) have been shown to play a major role in complex genetic disorders. SNPs data have received increasing attention for genetic association studies, especially with the development of HapMap project and the genome wide association studies. In contrast, only a few studies have been conducted to address the maternal-fetal genotype (MFG) incompatibility. It has been reported that MFG incompatibility is a major risk factor to a number of genetic diseases, including rheumatoid arthritis, schizophrenia, etc. It is interesting to note that while the MFG "mismatch" at certain locus (genomic position) may lead to higher risk for schizophrenia, the MFG "match" at a different locus may also lead to higher risk for the same disease. While the most frequently used statistical model for the MFG incompatibility requires the availability of the parents-offspring triad - genotype of both parents and the offspring, the frequently missing father genotype data make this method difficult to use in many case - control studies in parinatal research. I will review the MFG incompatibility and related health outcomes of interest. I will also demonstrate with a real example in parinatal research how to address the MFG incompatibility when father genotype is not available.

Jianhua Guo, A two-step method to study haplotype analysis incorporating informatively missing genotype data

  1. Currently, missing genotypes are features of most real data sets, which bring on much trouble for researchers during the haplotype analysis studies. Some methods have to eliminate the individuals with missing data before the studies. Although other methods can deal with missing data, their feasibilities depend on the assumption that genotypes are missing at random in that the underling missing data mechanism is more complicated. Recently, Liu et al. [2006] have shown that the violation of assumption will lead to serious impact on haplotype analysis, and they proposed a general missing data model to characterize missing data patterns and to perform haplotype analysis simultaneously. However, their method can not be extended to other haplotype analysis studies and the case of more markers easily. In this article, we will propose a more simple method to perform haplotype analysis incorporating informatively missing data. Our method can be easily extended not only to the case of more markers but also to almost all existing haplotype analysis methods but only a little modifications are needed. By simulation studies we find that our method work well in the presence of informatively missing genotype data and can reduce biases induced by missing genotypes or incorrectly assuming missing at random.

    This work is joint with Wen-Sheng Zhu.

Xueya Zhou, Fine-scale analysis of recombination rate variation Fine-scale analysis of recombination rate variation in the human genome

  1. Linkage disequilibrium, referred to as non-random association among the alleles in a population, has been the focus of intense studies with the hope to facilitate gene mapping. Earlier large-scale empirical surveys of LD among SNPs in the human genome have revealed some salient features including discrete block-like pattern, which have since then triggered the efforts to understand the underlying process that gave rise to this pattern. Special attentions have been paid to the variation in recombination rates.

    In the past several years, advances in statistical methods along with the increasing amount of genetic polymorphism data have made it possible to characterize the fine-scale recombination landscape across the human genome. Interesting findings include the identification of large number of recombination hotspots, where recombination events preferentially aggregate compared with flanking regions. Complementary to pedigree analysis and sperm-typing experiments, population genetics methods are providing new insights into the nature of recombination in human. In this talk, computational methods for inferring recombination rates and detecting hotspots will be reviewed, including two methods developed in our lab, and some major challenging open questions will be discussed. The knowledge about the distribution of recombination events can provide us opportunities to study the molecular mechanism and evolutionary impact of recombination.

    This work is joint with Jun Li, Zheng Ye, Xuegong Zhang.

Xiaowo Wang, The active evolution of microRNAs

  1. MicroRNAs (miRNAs) are a class of ~22nt long endogenous non-coding RNAs that play important regulatory roles in diverse organisms. Up to now, knowledge on evolutionary properties of these crucial regulators is limited. Most miRNAs were thought to be phylogenetically conserved, but recently, a number of poorly-conserved miRNAs have been reported and miRNA innovation is shown to be an ongoing process. Through the characterization of a vertebrate specific miRNA super family, we studied the evolutionary patterns of miRNAs in vertebrate. Relatively young miRNAs seem to evolve rapidly during a certain period following their emergence. Multiple lineage-specific expansions are observed. We also observed that the mature miRNAs may convert between the opposite stem arms following tandem duplications, which may have important contribution to miRNA innovation. Our observations of miRNAs' complicated evolutionary patterns support to the notion that these key regulatory molecules may play very active roles in the evolution.

    This work is joint with Yanda Li.

Jasmine Zhou, Functions, Networks, and Phenotypes by Integrative Genomics Analysis

  1. The rapid accumulation of genomics data provides unprecedented opportunities to systematically infer gene functions, regulatory networks, and phenotype associations. In this talk, we develop several graph-based data mining algorithms to integrate diverse genomics data, especially the vast amount of microarray data in the public repositories. A series of microarray data sets are modeled as a series of co-expression networks, in which we search for frequently occurring network patterns. Our integrative approach for functional annotation provides three major advantages over the commonly used microarray analysis methods: (1) enhance signal to noise separation (2) identify functionally related genes without co-expression, and (3) provides a way to predict gene functions in a context-specific way. Furthermore, we show that frequently occurring co-expression clusters are more likely to represent transcriptional modules than those clusters derived from a single microarray dataset. In addition, we propose the concept of "second-order correlation" which enables us to trace the upstream events of transcription cascades. Finally, we develop methods to systematically identify phenotype specific network patterns and regulatory modules.

Zewei Luo, A robust statistical approach for identifying and genotyping single feature polymorphisms in gene expression profiled from Affymetrix arrays

  1. The recent development of Affymetrix chips designed from assembled EST sequences in a wide range of species has spawned considerable interest in identifying single feature polymorphisms (SFPs) from transcriptome data. SFPs are valuable genetic markers which potentially offer a physical link to the structural genes themselves. However, most current SFP prediction methodologies were developed for sequenced species although SFPs are particularly valuable for species with complex and un-sequenced genomes. Here we report a statistical approach for identifying and genotyping at SFPs in gene expression profiled from Affymetrix arrays. The efficiency of the method is demonstrated by analyzing the datasets from yeast and barley microarray experiments and compared to those recently appearing in the literature.

Hsin-Chou Yang, Pooled DNA analysis using oligonucleotide arrays

  1. Microarray-based pooled DNA experiments that combine the merits of DNA pooling and gene chip technology constitute a pivotal advance in biotechnology. This new technique uses pooled DNA, thereby reducing costs associated with the typing of DNA from numerous individuals. Moreover, use of an oligonucleotide gene chip reduces costs related to processing DNA (e.g., primers, reagents). Thus, the technique provides an overall cost-effective solution for large-scale genomic/genetic research. Few publicly shared tools are available to systematically analyze the rapidly accumulating volume of whole-genome pooled DNA data. Here, we propose a generalized concept of pooled DNA and present an innovative user-friendly tool named Microarray Pooled DNA Analyzer (MPDA) that we developed to analyze hybridization intensity data from microarray-based pooled DNA experiments. MPDA enables whole-genome DNA preferential amplification/hybridization analysis, allele frequency estimation, association mapping, allelic imbalance detection, and permits integration with shared data resources online. Results of four whole-genome data analyses illustrate these major functionalities. Graphic and numerical outputs from MPDA support global and detailed inspection of large amounts of genomic data. These merits make MPDA useful for identifying disease susceptibility genes and detecting chromosomal aberrations across the human genome.

Mengling Liu, Interval mapping of quantitative trait loci using mixture cure model

  1. When censored time-to-event data are used to map quantitative trait loci (QTL), the existence of nonsusceptible subjects entails extra challenges. If the heterogeneous susceptibility is ignored or inappropriately handled, we may either fail to detect the responsible genetic factors or find spuriously significant locations. In this article, an interval mapping method based on parametric mixture cure models is proposed, which takes into consideration of nonsusceptible subjects. The proposed model can be used to detect the QTL that are responsible for differential susceptibility and/or time-to-event trait distribution. In particular, we propose a likelihood-based testing procedure with genome-wide significance levels calculated using a resampling method. The performance of the proposed method and the importance of considering the heterogeneous susceptibility are demonstrated by simulation studies and an application to survival data from an experiment on mice infected with Listeria monocytogenes.

Juan Liu , Biclustering of gene expression data based on bucketing technique

  1. With the rapid advancement of genome sequencing projects, microarrays and related high-throughput technologies have become key factors in the study of global aspects of biological systems. Generally speaking, gene expression data are stored in a matrix, in which each row corresponds to a gene, each column corresponds to a condition, and each entry denotes the expression level of the gene under the specific condition. When gene expression data are analyzed, common pursued objectives are to cluster genes over all conditions or cluster conditions over all genes. The promise of these objectives is that the similar genes exhibit similar behaviors over all conditions, or vice versa. The assumption is reasonable for the analysis of small data sets, but it limits the utility of these methods for the analysis of large data sets. There are two reasons. Firstly, a subset of genes can be coregulated and coexpressed only under some certain experimental conditions, but behave almost independently under other conditions. Secondly, genes may participate in more than one function, resulting in one regulation pattern in one context and a different pattern in another. Thus, a subset of genes should be grouped into a cluster not over all conditions but over a subset of conditions. Similarly, in a large gene expression matrix, biologically similar conditions may be grouped more readily by focusing on specific genes. To address this problem, biclustering is proposed in recent years. Biclustering is the process of grouping a subset of objects over a subset of their attributes into a class, in which each object is similar and each attribute is related to the classification. In a simple word, biclustering is to cluster the vectors and their attributes simultaneously.

    we present a novel algorithm, bucketing and extending algorithm (BEA), to bicluster gene expression data. The algorithm contains three steps to find the optimal bicluster B(I,J) in the gene expression matrix X(G,C): bucketing, finding core submatrix and greedy extension. The purpose of bucketing process is to select a submatrix W(U,V) from the given matrix X(G,C), which overlaps the bicluster B(I,J) with high probability and is called as raw submatrix. The purpose of finding core submarix process is to select a submatrix O(M,N) of bicluster B(I,J), called core submatrix, from the raw submatrix W(U,V). Since the core submatrix O(M,N) is only a part of bicluster B(I,J), we use a greedy strategy to extend the O(M,N) into bicluster B(I,J) in the greedy extension process. For different types of biclusters, we adopt different methods to realize the three processes. Simulation tests show that our algorithm is robust and can find the implanted biclusters properly, including the overlapped biclusters. Tests on real data sets show that our algorithm can beat most of other methods. Applications of the biclustering will also be discussed.