Fast Search in DNA Sequence Databases using Punctuation and Indexing

Y. Lu, S. Lu, and J.L. Ram (USA)

Keywords

Algorithm, DNA Sequence, Search, Punctuator, Indexing

Abstract

Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, Compressed Punctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences. cp-BM encodes two bits to represent each A, T, C, G character (4-character 8 bit (4C8B) compression), plus punctuator characters to indicate unambiguously the encoding frame of the compressed target sequence, thereby solving the misalignment problem in searching patterns with ordinary 4C8B compression. cp-BM searches DNA patterns at least 6 times faster than AGREP for pattern lengths ≥ 128 and between 2-fold and 5-fold faster than d-BM for all pattern lengths. cp-BM’s performance is enhanced by punctuator indexing and multiple punctuators, especially for short sequences, yielding greater than 10-fold enhancements compared to d-BM and AGREP. In addition, cp-BM outperformed BLAT for sequences 64 or more bases in length, and was more than three-fold faster for 256 base sequences.

Important Links:



Go Back