MDscan

A Fast and Accurate Motif Finding Algorithm With Applications To Chromatin Immunoprecipitation Microarray Experiments

Xiaole Shirley Liu, Douglas L. Brutlag, Jun S. Liu
Stanford Medical Informatics, Stanford University

While chromatin immunoprecipitation followed by cDNA microarray (ChIP-on-chip) has become a popular procedure for studying genome-wide protein-DNA interactions and transcription regulation, it can only map the probable protein-DNA interaction loci within 1-2kb resolution. To pinpoint the interaction sites down to the base pair level, we introduce a novel computational method, Motif Discovery scan (MDscan), that examines the ChIP-array selected sequences and searches for DNA sequence motifs representing the protein-DNA interaction sites. MDscan combines the advantages of two widely adopted motif search strategies, word enumeration and position-specific weight matrix updating, and incorporates the ChIP enrichment information to accelerate the search and enhance its success rate. The intuition is to first search for similar words appearing in the sequences more likely to contain the motif (highly ChIP-enriched sequences) because these sequences have higher signal to noise ratio. Words in each similarity group can initialize a position specific motif matrix and the motif can be updated and refined with the whole input sequences (all ChIP-selected targets). The method showed both speed and accuracy advantages compared to several established motif-finding algorithms in both simulation and published yeast ChIP-on-chip experiments. MDscan can be used not only with the ChIP experiments, but also to find DNA motifs in other experiments in which a subgroup of the sequences can be inferred to contain relatively more abundant motif sites.

We recently developed a new program, Motif Regressor, to better utilize mRNA expression level or ChIP-on-chip enrichment information to improve the performance of MDscan. Motif Regressor first identifies a set of non-redundant candidate motifs using MDscan, and scans the promoter region of every gene in the genome with each candidate motif to measure how good a promoter matches a motif (in terms of both the number of sites and the strength of matching). It then uses linear regression analysis to select motifs whose promoter matching scores are significantly correlated with ChIP-on-chip enrichment or downstream gene expression values. When ranking motifs by linear regression p-value, Motif Regressor automatically picks the best motif and optimal motif width. Due to its computational intensity, Motif Regressor is not currently available as a web server. However, for interested users to explore the program locally, the program is available for download at: http://www.math.umass.edu/~conlon.mr.html

Obtaining a local copy of MDscan:
MDscan is free-of-charge to academia. Please check out:
Brutlag Bioinformatics Group Software Download and
Academic License Instructions for details.

Reference:
Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat Biotechnol. 2002 20(8):835-9.