The research in our laboratories are focused on the following three areas:
1. Bioinformatics: The development of high throughput genomic technologies has created many exciting opportunities as well as analysis challenges. Our group has developed some of the most widely used and cited bioinformatics methods to analyze high throughput data. Our transcription factor motif finding tools have been cited over 1500 times and our ChIP-chip/seq peak callers have over 6,000 registered users. We will continue to develop novel computational algorithms to analyze new high throughput data, such as ChIP-seq (MACS, CEAS), RIP-seq, DNase-seq, MNase-seq (NPS), DNA-seq, and RNA-seq (Gfold). We will also build integrative analysis pipelines (Cistrome) to better help experimental biologists, and conduct efficient data integration to better mine the hidden biological insights from publicly available high throughput data and refine hypotheses. Finally, we will integrate good genomics experimental design and bioinformatics analyses to best utilize the newest technologies in gene regulation studies.
2. Epigenetics: Epigenetics play an important role in gene regulation, and include diverse topics such as DNA methylation, nucleosome positioning, histone marks, epigenetic enzymes, and higher order chromatin interactions. We and colleagues generated the first high throughput nucleosome map in the human genome, identified monovalent genes in early embryonic development, and found the relationship between H3K36me3 exon enrichment and co-transcriptional splicing. We will focus on two major areas of epigenetic research. The first is use the dynamics of histone mark ChIP-seq and DNase-seq to infer in vivo transcription factor binding and understand transcription regulatory networks. The second is to use genome-wide approaches to understand the specificity and mechanism of epigenetic enzymes and lncRNAs (with epigenetic function). Despite intensive research efforts, our knowledge about these areas is still limited, so there will be exciting opportunities in the future.
3. Cancer: As one in three people in the developed countries will get cancer, research on the mechanisms and treatments of cancer will become increasingly important. We and colleagues identified the function of estrogen receptor, androgen receptor, and FoxA1 in breast and prostate cancers, TET1 in leukemia, DREAM complex in cell cycle control, and found metabolic and autoimmune genes as signatures associated with cancer initiation. Cancer is a genetic disease amenable for research using genomic approaches. First, we will integrate publicly available high throughput data to better understand cancer pathways. Recently many cancer studies have found mutations or misregulations in epigenetic enzymes. Many pharmaceutical and biotech companies as well as academic scientists are actively developing cancer drugs targeting epigenetic enzymes. We will study the genome-wide function and response of cancer cells to epigenetic drugs, and identify cancer patients that might respond better to certain cancer drugs based on the genetic, epigenetic, and gene expression status of their tumor.
"Transcription factor binding events often leave a trace pattern of nucleosome occupancy changes in which nucleosomes flanking the binding site increase in occupancy while those in the vicinity of the binding site itself are displaced. Genome wide information on enhancer proximal nucleosome occupancy can be readily acquired using ChIP-seq targeting enhancer related histone modifications such as H3K4me2. Here we present a software package, BINOCh, that allows biologists to use such data to infer the identity of key transcription factors that regulate the response of a cell to a stimulus or determine a program of differentiation."
"The development of high throughput genome sequencing and gene expression techniques gives rise to the demand for data-mining tools. BioProspector, a C program using a Gibbs sampling strategy, examines the upstream region of genes in the same gene expression pattern group and looks for regulatory sequence motifs. BioProspector uses Markov background to model the base dependencies of non-motif bases, which greatly improved the specificity of the reported motifs. The parameters of the Markov background model are either estimated from user-specified sequences or pre-computed from the whole genome sequences. A new motif scoring function is adopted to allow each input sequences to contain zero to multiple copies of the motif. In addition, BioProspector can model gapped motifs and motifs with palindromic patterns, which are prevalent motif patterns in prokaryotes. All these modifications greatly improve the performance of the program. Besides showing preliminary success in finding the binding motifs for S. cerevisiae RAP1, B. subtilis RNA polymerase, and E. coli CRP, we have used BioProspector to find s54 motif from M. xanthus genome, many B. subtilis motifs from DBTBS collection of promoters, and motifs from yeast expression data. "
"We present a tool designed to characterize genome-wide protein-DNA interaction patterns from ChIP-chip and ChIP-Seq of both sharp and broad binding factors. As a stand-alone extension of our web application CEAS (Cis-regulatory Element Annotation System), it provides statistics on ChIP enrichment at important genome features such as specific chromosome, promoters, gene bodies, or exons, and infers genes most likely to be regulated by a binding factor. CEAS also enables biologists to visualize the average ChIP enrichment signals over specific genomic features, allowing continuous and broad ChIP enrichment to be perceived which might be too subtle to detect from ChIP peaks alone."
"GFOLD is especially useful when no replicate is available. GFOLD generalizes the fold change by considering the posterior distribution of log fold change, such that each gene is assigned a reliable fold change. It overcomes the shortcoming of p-value that measures the significance of whether a gene is differentially expressed under different conditions instead of measuring relative expression changes, which are more interesting in many studies. It also overcomes the shortcoming of fold change that suffers from the fact that the fold change of genes with low read count are not so reliable as that of genes with high read count, even these two genes show the same fold change."
"HMMTiling is a comprehensive software package for tiling array data analysis. It includes command line python applications for filtering, mapping, quantile-normalizing and enriched-region identification from ChIP-chip experiments on tiling arrays. HMMTiling models the behavior of each individual probe as a baseline for each ChIP or control experiment to be compared with. It then uses a Hidden Markov Model to identify the enrichment probability at each probe location, thus can determine the exact enriched regions."
"While chromatin immunoprecipitation followed by cDNA microarray (ChIP-on-chip) has become a popular procedure for studying genome-wide protein-DNA interactions and transcription regulation, it can only map the probable protein-DNA interaction loci within 1-2kb resolution. To pinpoint the interaction sites down to the base pair level, we introduce a novel computational method, Motif Discovery scan (MDscan), that examines the ChIP-array selected sequences and searches for DNA sequence motifs representing the protein-DNA interaction sites. MDscan combines the advantages of two widely adopted motif search strategies, word enumeration and position-specific weight matrix updating, and incorporates the ChIP enrichment information to accelerate the search and enhance its success rate. The intuition is to first search for similar words appearing in the sequences more likely to contain the motif (highly ChIP-enriched sequences) because these sequences have higher signal to noise ratio. Words in each similarity group can initialize a position specific motif matrix and the motif can be updated and refined with the whole input sequences (all ChIP-selected targets). The method showed both speed and accuracy advantages compared to several established motif-finding algorithms in both simulation and published yeast ChIP-on-chip experiments. MDscan can be used not only with the ChIP experiments, but also to find DNA motifs in other experiments in which a subgroup of the sequences can be inferred to contain relatively more abundant motif sites."
"We introduce a new software tool, the Microarray Blob Remover (MBR), which allows rapid visualization, detection, and removal of blob defects of a variety of sizes and shapes from different types of microarrays using their .CEL files. Removal of the affected probes in the blob-defects using MBR was shown to significantly improve sensitivity and FDR compared to leaving the affected probes in the analysis."
"The proposed method, together with robust estimates of the model parameters, is shown to perform superbly on published data sets. Accompanying the normalization method, a robust algorithm for detecting peak regions is formulated and also shown to perform well compared to other approaches. The tools presented herein have been implemented for NimbleGen tiling arrays as a stand-alone Java program, which can also display various plots of statistical analysis for quality control of experiments." "A model-based algorithm for analyzing 2-color microarrays."
"Next generation parallel sequencing technologies made chromatin immunoprecipitation followed by sequencing (ChIP-Seq) a popular strategy to study genome-wide protein-DNA interactions, while creating challenges for analysis algorithms. We present Model-based Analysis of ChIP-Seq (MACS) on short reads sequencers such as Genome Analyzer (Illumina / Solexa). MACS empirically models the length of the sequenced ChIP fragments, which tends to be shorter than sonication or library construction size estimates, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome sequence, allowing for more sensitive and robust prediction. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, is publicly available open source, and can be used for ChIP-Seq with or without control samples."
"A model-based algorithm for finding enriched regions in ChIP-Chip experiments" "We propose a novel analysis algorithm MAT to reliably detect regions enriched by transcription factor Chromatin ImmunoPrecipitation (ChIP) on Affymetrix tiling arrays (chip). MAT models the baseline probe behavior by considering probe sequence and copy number on each array. The correlation between the baseline probe model estimates and the observed measurements can be as high as 0.72. MAT standardizes the probe value via the probe model, eliminating the need for sample normalization. A novel scoring function is applied to the standardized data to identify the ChIP-enriched regions, which allows robust p-value and false discovery rate calculations. MAT can detect ChIP-regions from a single ChIP sample, multiple ChIP samples, or multiple ChIP samples with controls with increasing accuracy. Based on the mock ChIP samples provided by the ENCODE consortium, MAT achieved 100% accuracy (0 false positive and 0 false negative) for the target detection of those spike-in plasmids, which are 2,4,8,-256 fold enriched compared with the genomic background. Quantitatively, MAT yielded a 0.95 correlation coefficient between the spike-in DNA concentration and the predicted score. Upon further analysis, MAT identified more than 70% of the true targets at 5% FDR cutoff from a single ChIP sample. This is a valuable feature for quickly testing the protocols and antibodies for ChIP-chip, and easily identifying ChIP-chip samples with questionable quality."
"A signal processing-based algorithm for identifying positioned nucleosomes fromsequencing experiments at the nucleosome level" "NPS is a python software package that can identify nucleosome positions given histone-modification ChIP-seq or nucleosome sequencing at the nucleosome level. NPS obtains continuous wave-form that represents the enrichment of histone modifications (or nucleosomes) by extending each tag (25nt, Solexa) to 150nt in the 3’ direction and taking the middle 75, and detects the positions of nucleosomes based on Laplacian of Gaussian (LOG) edge detection. The p value of each detection was estimated using Poisson approximation and the user can decide a cut-off for the final selection of nucleosome positions. In case of histone modification, the sequence tags are regrouped by different types of histone modification after nucleosome positioning and then the p-value of a particular histone modification at a positioned nucleosome was calculated based on the tag count of that histone modification in the nucleosome region using Poisson distribution, similar to the method mentioned above. The user also can select a cut-off of p value in histone modification assignment."
"Gene expression analysis pipelines for SAGE-Seq including tag mapping, novel normalization method using empirical Bayes and differential gene analysis" "SAGE-Express is a package of pipelines to process SAGE-Seq high throughput gene expression data set. The pipelines are maily composed of 3 parts:
1. Mapping of sense and antisense strands of mitochondrial and RefSeq genes;
2. Library normalization using empirical Bayes method;
3. Identification of differentially expressed genes. "