Research
My research area is computational
genomics. The broad goal of my research is to develop efficient and accurate
methodologies for the analysis of genomic data. In recent years, genomics has rapidly become
one of the main ways in which to study biology. Key milestones along this
development were the draft release of the human genome in 2000, the subsequent
sequencing of tens of vertebrate genomes including chimpanzee, mouse, rat, dog,
and chicken, with which the human genome can be compared, sequencing of tens of
fruitfly, yeast, plant, and hundreds of bacterial species, and the availability
of global datasets on the expression and interactions of genes in human and key
laboratory organisms. To enable the analysis of such genomic data, drawing from
string algorithms, combinatorial optimization, machine learning, and efficient
software development, we build methods and systems for a range of bioinformatics
problems: sequence assembly, whole-genome multiple alignment, protein sequence
alignment, RNA structure prediction, gene finding, motif finding, protein
association network alignment, and population ancestry inference on genotype
data. All our tools are
publicly available as source code and executables. Genomic Sequence Assembly Whole-Genome Assembly of Pyrosequencing Reads. Today there is a significant push by both academia and industry to develop new technologies for genomic sequencing. Such technologies ideally should be cheaper than Sanger sequencing by orders of magnitude. That would enable determination of the DNA of each individual for medical applications, or to sequence a very large number of animals for tracing the evolutionary path of every functional element in our genomes, or to comprehensively sequence environmental samples so as to see how microorganisms in environments such as soil or within our bodies, respond to changes in conditions and evolve over time. Our group collaborates with Mostafa Ronaghi at the Stanford Genome Technology Center; Ronaghi's group develops a pyrosequencing-based technology for genomic sequencing that promises orders-of-magnitude cheaper sequencing, and we develop sequencing protocols and algorithms for de novo fragment assembly of data that come from pyrosequencing.
We developed SHRAP [2], a method for assembling complex genomes using short
DNA fragments, and demonstrated with simulations that assembling a whole
mammalian genome with such fragments is feasible. Gene Finding Once a genome is available, one of the
first analysis tasks is to annotate protein-coding genes. We developed
CONTRAST, the first system
for vertebrate gene prediction that successfully uses multiple alignments to
achieve greater accuracy than methods based on two-species alignments [16].
CONTRAST predicts 65% more
exact genes than the previous state-of-the-art method, misses 46% fewer
exons, and displays comparable gains in specificity. Methodologically
CONTRAST employs SVMs for
scoring various coding region boundaries, a conditional random field (CRF)
for integrating all features of gene structure, and a novel objective
function for training, based on maximizing the accuracy of detected coding
region boundaries. We applied
CONTRAST in the context of the Drosophila Comparative Sequencing and
Analysis Consortium to contribute whole-genome gene predictions for 12
Drosophila genomes [17]. In related methodological work we developed the
maximum parse accuracy procedure for training CRFs [18] as an
alternative to max-margin methods and are planning to employ it for gene
finding. Comparative Genomics Genomic Alignment Methods. We have been developing algorithms and systems for aligning DNA sequences. LAGAN is a global pairwise aligner; MLAGAN is a multiple aligner
[3], and SLAGAN [13] is a tool that simultaneously aligns and finds microrearrangements. Our systems are available at http://lagan.stanford.edu. See also TreeRefiner, a tool for refining multiple sequence alignments. We have applied our tools to test the power of comparative
sequence analyses [4, 5] prior to the sequencing of mammalian genomes that is
now undergoing, in order to help establish a set of most informative mammals to
sequence. We aligned the whole human, mouse, and rat genomes as part of the
International Rat Sequencing Consortium [6, 7], three fungal genomes [8] as part
of the Aspergillus consortium, and 1% of the human genome with 23 mammals as
part of the ENCODE consortium [9, 10, 11]. Whole Genome Alignments. In collaboration with Inna Dubchak's group, and with Arend Sidow's group, we aligned the entire human, mouse, and rat genomes, and are studying their evolution. These alignments are accessible through the VISTA Browser gateway. Inna Dubchak also provides visualizations of our ENCODE alignments.
Currently we are developing a new tool, Revolver, a new tool for automated
whole-genome multiple mammalian alignments, and Evolver, a simulator of
mammalian genome evolution. Multiple Local Alignment. More and more, genomic databases will consist of multiple alignments of related species (mammals, flies, yeasts) rather than individual genomes. Therefore, one may wish to search a query sequence against a database of alignments. As it turns out, BLAST-style local alignment can be done with better speed and sensitivity in this setting. Typhon
[14] is a new tool that achieves that. Protein Alignments. Protein multiple alignment is challenging, especially at the "twilight" zone of low sequence identity.
We developed ProbCons [19, 20], a tool that exhibits high accuracy and practical running times in benchmark datasets. ProbCons is based on a hidden Markov model, and on a probabilistic method we call probabilistic consistency that leverages comparisons to all sequences of the set, during progressive alignment of two (multi-)sequences. ProbCons also uses maximum expected accuracy, rather than Viterbi, when computing alignments. ProbCons is available to run or download at http://probcons.stanford.edu. The work was presented as an abstract to ISMB/ECCB 2004, and received the Best Paper Award together with Muscle, which is a really scalable and accurate protein aligner. Subsequently, we developed an improved protein aligner,
CONTRAlign [21], which
combined the ProbCons algorithm with the first discriminative training framework
for protein alignment, to provide further accuracy gains without loss of
efficiency. CONTRAlign can use a diverse set of features such as
hydrophobicity, solvent accessibility, and secondary structure information;
users can specify feature sets and train their own alignment models easily and
rigorously using CONTRAlign. Protein Alignment
with Repeats and Rearrangements. We developed
ProDA [22], a multiple alignment program for
proteins that have different domain architectures and therefore their homology
may exhibit repeats and rearrangements. RNA Structure Prediction Together with proteins, RNA structures are
the key biomolecules that perform regulatory, catalytic, structural and
other biological functions in a cell. RNA structure is determined primarily
by the base pairing of nucleotides in the RNA sequence, and computational
methods have been among the main ways to predict RNA structure. For several
decades, free-energy minimization methods have been the dominant
computational strategy. More recently, stochastic context-free grammars (SCFGs)
have emerged as an alternative probabilistic method, enabling fully
automated statistical learning algorithms to derive model parameters.
However, energy minimization methods such as Mfold have clearly dominated
probabilistic methods in predicting structures accurately. We developed
CONTRAfold [23], a
secondary structure prediction method based on conditional log-linear models
(CLLMs), which is the first method enabling automated parameter learning to
outperform energy minimization methods, and in particular Mfold which had
been the best method for almost two decades. Our result demonstrates that
statistical learning procedures provide an effective alternative to
laborious experimental measurements of thermodynamic parameters for RNA
secondary structure prediction. In earlier work, we defined a multiple RNA
structure alignment model of theoretical interest, proved that the computational
problem of structural alignment is APX-hard, and provided an O(log2N)
approximation algorithm [24]. Returning to multiple RNA structural alignment, we
recently developed CONTRAfold-multi, which is a practical method of the highest
accuracy to-date for aligning multiple RNA structures. This method is based on
our CLLM algorithm for RNA folding [23] in combination with our probabilistic
consistency algorithm [21] for sequence alignment. Motif Finding Regulatory motifs are short DNA sequences
that reside in the vicinity of genes, bind transcription factors, and
control the binding of RNA polymerase and consequent transcription of a
gene; therefore motifs control the amount and timing of gene expression. The
computational problems of modeling motifs, finding over-represented motifs
in genomic regions, and scanning a genome for specific motifs, have been
studied extensively. The most popular motif models are based on the
position-specific scoring matrix (PSSM) model and its variations: a
motif is a probabilistic word m = m1…mk, where each
letter mi is a probability distribution over {A, C, G, T}.
Pairwise letter dependencies and more complex models have also been tried.
Learning richer models, however, requires large numbers of motif instances.
The most widely used motif finding methods search for a PSSM with either
Expectation Maximization or Gibbs Sampling. We developed
CompareProspector [25, 26],
the first Gibbs sampler for motif finding that successfully leveraged
cross-species comparisons during the search procedure. Next, we departed from
the PSSM framework and modeled a motif as simply a bag, or multiset, consisting
of all known occurrences of the motif. This motif model grows with additional
known instances, and sidesteps the difficult task of parameterizing complex
motif substructure. We incorporated this model in
MotifScan [27], a method for
searching a genome for a known motif, and showed that our model outperforms
PSSMs. Motivated by this new model, we developed MotifCut [28], a method that
translates motif finding into finding maximum-density subgraphs in a graph that
represents the input sequences. This is a convex optimization problem; with
heuristic optimizations we solve it in practice in sub-quadratic time.
MotifCut is practical and
significantly outperforms previous state-of-the-art methods. Moreover, because
our methodology is fundamentally different than Gibbs samplers and EM-based
methods, MotifCut often finds motifs where all other methods fail, and
complements the existing motif finding toolbox. Protein Association Networks Today we have the complete genomes of
almost 500 microbes and tens of mammals and other organisms. The first steps
in learning biology from these genomes are finding genes and regulatory
elements, and comparing genomes and proteins across species. An important
next step is to discover which groups of proteins participate in the same
functional processes. In protein association networks, proteins are
nodes and edges connect pairs of proteins that are involved in the same
functional process, such as a metabolic pathway or multiprotein complex. A
new and rapidly growing area of biocomputation involves integrating such
networks from diverse systems-biology and sequence data (network
integration), and comparing networks across species to reveal common
functional processes (network alignment). We developed a novel
network integration algorithm, SRINI [29], which we recently applied to all
sequenced microbes to obtain networks available online at
http://networks.stanford.edu
[30]. To compare networks across multiple species, we developed a network
alignment algorithm, Graemlin
[31], which is the first algorithm capable of aligning more than 3 networks,
and provides orders-of-magnitude speedup and significantly better accuracy
than previous approaches for network alignment. Population Genetics Today we are experiencing a revolution in
our ability to inexpensively obtain personalized genomic information, in the
form of genotyping chips or whole-genome resequencing. Challenging
computational problems are emerging, with enormous potential impact in
medicine and in understanding our genetic heritage. We recently addressed
the problem of ancestry reconstruction. The genome of an individual is a
mosaic of chromosomal blocks, each following the statistical properties of
variations seen in the populations to which the individual’s ancestors
belong. We developed HAPAA [32], a
novel method for inferring the population ancestry of each chromosomal
block, given the genotype of an individual.
HAPAA employs a non-parametric
representation of allelic variation in the source populations, which in
contrast to previous methods, captures the signal of linkage disequilibrium.
As a result, HAPAA is substantially
more accurate, and finds the source populations of chromosomal blocks
contributed to an individual’s genome by ancestors that existed 20 or more
generations ago. Microarrays We explored the application of Independent Component Analysis (ICA) to microarrays. ICA provides a model of underlying biological processes within a cell. Based on the ICA model, genes can be clustered according to their loads in the derived components. This clustering technique demonstrated remarkable performance in a variety of datasets. Our software and results can be found at http://www.stanford.edu/~silee/ICA/. We are also working on methods that combine comparative genomics and microarrays to computationally find motifs that are putative sites of gene cis-regulation.
References
1.
Batzoglou S, Jaffe D, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B,
Mesirov JP, Lander ES. ARACHNE: A whole genome shotgun assembler. Genome
Research 12:177-189, 2002.
2.
Sundquist A, Ronaghi M, Tang H, Pevzner P, Batzoglou S. Whole-genome sequencing
and assembly with high-throughput short-read technologies. PLOS One
2(5): e484, 2007.
3.
Brudno M, et al. LAGAN and Multi-LAGAN: Efficient tools for large-scale
multiple alignment of genomic DNA. Genome Research 13: 721-731,
2003.
http://lagan.stanford.edu.
4.
Cooper GM, et al. Quantitative estimates of sequence divergence for comparative
analyses of mammalian genomes. Genome Research 13:813-820, 2003.
5.
Cooper GM, Brudno M, Stone ES, Dubchak I, Batzoglou S, Sidow A.
Characterization of evolutionary rates and constraints in three mammalian
genomes. Genome Research 14:539–548, 2004.
6.
Rat Genome Sequencing Project Consortium (RGSPC). Genome sequence of the Brown
Norway Rat yields insights into mammalian evolution. Nature 428:493–521,
2004.
7.
Brudno M, et al. Automated whole-genome multiple alignment of Rat, Mouse, and
Human. Genome Research 14:685–692, 2004.
8.
Galagan JE, et al. Sequencing of Aspergillus nidulans and comparative
analysis with A. fumigatus and A. oryzae. Nature, 438:
1105–1115, 2005.
9.
The ENCODE Project Consortium. The ENCODE Project. Science
306:636-640, 2004.
10.
The ENCODE Project Consortium. Identification and analysis of functional
elements in 1% of the human genome by the ENCODE pilot project. Nature
447: 799-816, 2007.
11.
Margulies et al. Analyses of deep mammalian sequence alignments and
constraint predictions for 1% of the human genome. Genome Research,
17(6): 760-774, 2007.
12.
Brudno M, Malde S, Poliakov A, Do C, Couronne O, Dubchak I, Batzoglou S. Glocal
alignment: finding rearrangements during alignment. Proceedings of the ISMB
2003, Bioinformatics 19: 54i-62i, 2003.
13.
Sundararajan M, Brudno M, Small K, Sidow A, Batzoglou S. Chaining algorithms
for alignment of draft sequence. Proceedings of the 4th Workshop
on Algorithms in Bioinformatics (WABI2004).
14.
Flannick J, Batzoglou S. Using multiple alignments to improve seeded local
alignment algorithms. Nucleic Acids Research, 33(14): 4563–4577,
2005.
http://typhon.stanford.edu.
15.
Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES.
Human and mouse gene structure: comparative analysis and application to exon
prediction. Genome Research 10:950-958, 2000.
16.
Gross SS, Do CB, Sirota M, Batzoglou S. CONTRAST: A discriminative,
phylogeny-free approach to multiple informant de novo gene prediction. Genome
Biology, in press.
http://contra.stanford.edu/contrast.
17.
Drosophila Comparative Genome
Sequencing and Analysis Consortium. Evolution of genes and genomes in the
context of the Drosophila phylogeny. Nature, in press.
18.
Gross SS, Russakovsky O, Do CB, Batzoglou S. Training conditional random fields
for maximum parse accuracy. NIPS 2006.
19.
Do CB, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based
multiple alignment of amino acid sequences. Best Paper Award, ISMB/ECCB 2004.
http://probcons.stanford.edu.
20.
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: probabilistic
consistency-based multiple sequence alignment. Genome Research 15:330–340,
2005.
21.
Do CB, Gross SS, Batzoglou S. CONTRAlign: Discriminative Training for Protein
Sequence Alignment. Proceedings of the Tenth Annual International Conference
on Computational Molecular Biology, (RECOMB 2006), pp. 160–164.
http://contra.stanford.edu/contralign.
22.
Phuong TM, Do CB, Edgar RC, Batzoglou S. Multiple alignment of protein
sequences with repeats and rearrangements. Nucleic Acids Research,
34(20): 5932-5942, 2006.
23.
Do CB, Woods DA, Batzoglou S. CONTRAfold: RNA Secondary Structure Prediction
without Physics-Based Models. Best Paper Award, ISMB 2006.
http://contra.stanford.edu/contrafold.
24.
Davydov E, Batzoglou S. A computational model for RNA multiple structural
alignment. Combinatorial Pattern Matching 2004.
25.
Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S. Eukaryotic regulatory element
conservation and identification using comparative genomics. Genome Research
14:451-458, 2004.
http://compareprospector.stanford.edu.
26.
Liu Y, Wei L, Batzoglou S, Brutlag DL, Liu JS, Liu XS. A suite of web-based
programs to search for transcriptional regulatory motifs. Nucleic Acids Research
32:W204 - W207, 2004.
27.
Naughton B, Fratkin E, Batzoglou S, Brutlag DL. A graph-based motif detection
algorithm models complex nucleotide dependencies in transcription factor binding
sites. Nucleic Acids Reesearch, 34(20): 5730-5739, 2006.
http://motifscan.stanford.edu.
28.
Fratkin E, Naughton B, Brutlag DL, Batzoglou S. MotifCut: Finding Regulatory
Motifs with Maximum Density Subgraphs. Proceedings ISMB2006,
Bioinformatics 22: e150-e157, 2006.
http://motifcut.stanford.edu.
29.
Srinivasan B, Novak A, Flannick J, Batzoglou S, McAdams H. Integrated Protein
Interaction Networks for 11 Microbes. Proceedings of RECOMB 2006, pp. 1–14.
30.
http://networks.stanford.edu.
31.
Flannick J, Novak A, Srinivasan BS, McAdams HH, Batzoglou S. Graemlin: general
and robust alignment of multiple large interaction networks. Genome
Research, 16:1169-1181, 2006.
http://graemlin.stanford.edu.
32.
Sundquist A, Fratkin E, Do CB, Batzoglou S. HAPAA: HMM-based analysis of
polymorphisms in admixed ancestries. Submitted.
http://hapaa.stanford.edu.
Posters
Short Read Assembly. A poster by Andreas Sundquist describing
his sequencing protocol and assembly algorithm for reconstructing a whole
mammalian genome from short unpaired reads, such as those that
Pyrosequencing or some other high-throughput sequencing technology may be
able to produce in the near future.
Typhon: a local multiple aligner. A poster by Jason Flannick
describing his method for indexing a multiple alignment for seeded homology
search, and his Typhon tool that implements the method. This was presented
in the Genome Informatics meeting of CSHL in October 2005.
An extended-model DNA alignment system. A poster by
Chuong Do describing an extension that he implemented of the LAGAN pairwise aligner, incorporating a more elaborate alignment model that includes: (1) a double-affine gap model for alignment; (2) a non-conserved state to absorb unalignable sequence; (3) a coding model for translated alignment of protein-coding regions.
Draft alignment. Some of our unpublished work on aligning two genomes that are given in draft assembly form. Eugene Davydov recently proved that this problem is NP-complete.
Application of ICA to microarrays.
A poster by Suin Lee describing our preliminary results of applying independent component analysis to microarrays. It was presented in the May 2003 Genome Informatics CSHL meeting.
LAGAN GUI. Marina Sirota started developing a gui for LAGAN and MLAGAN during Summer 2003. This is her CURIS student poster.