A NOVEL APPROACH FOR COMPRESSING DNA SEQUENCES USING SEMI-STATISTICAL COMPRESSOR

doi:10.2316/Journal.202.2011.3.202-3114

A NOVEL APPROACH FOR COMPRESSING DNA SEQUENCES USING SEMI-STATISTICAL COMPRESSOR

Ashutosh Gupta and Suneeta Agarwal

References

[1] S. Grumbach and F. Tahi, Compression of DNA sequences,DCC, 1993, 340–350.
[2] A. Gupta, V. Rishiwal, and S. Agarwal, Eﬃcient storage ofmassive biological sequences in compact form, Proc. of 3rdIntl. Conf. Contemporary Computing-Part II, 2010 Communi-cations in Computer and Information Science Series, Springer,Noida, India, 95, Aug 9–11, 2010.
[3] N. Kamel, Panel: Data and knowledge bases for genomemapping: What lies ahead?, Proc. Intl. Very Large Databases,1991.
[4] L. Stern, L. Allison, R.L. Coppel, and T.I. Dix, Discoveringpatterns in plasmodium falciparum genomic DNA, Molecular& Biochemical Parasitology, 118, 2001, 175–186.
[5] D.R. Powell, L. Allison, and T.I. Dix, Modelling-alignmentfor non-random sequences, Advances in Artiﬁcial Intelligence,2004, 203–214.
[6] X. Chen, S. Kwong, and M. Li, A compression algorithm forDNA sequences and its applications in genome comparison,RECOMB, 2000, 107.
[7] S. Grumbach and F. Tahi, A new challenge for compressionalgorithms: genetic sequences. Information, Process Manage-ment, 30(6), 1994, 875–866.
[8] T. Matsumoto, K. Sadakane, and H. Imai, Biological sequencecompression algorithms, Genome Informatics, 11, 2000, 43–52.
[9] D. Adjeroh and F. Nan, On compressibility of protein sequences,DCC, 2006, 422–434.
[10] D.M. Boulton and C.S. Wallace, The information content ofa multistate distribution, Theoretical Biology, 23(2), 1969,269–278.
[11] J.G. Cleary and I.H. Witten, Data compression using adaptivecoding and partial string matching, IEEE Transaction andCommunication, COM-32(4), 1984, 396–402.
[12] T.I. Dix, D.R. Powell, L. Allison, S. Jaeger, J. Bernal, and L.Stern, Exploring long DNA sequences by information content,Probabilistic Modeling and Machine Learning in Structural andSystems Biology, Workshop Proc, 2006, 97–102.
[13] T.I. Dix, D.R. Powell, L. Allison, S. Jaeger, J. Bernal, andL. Stern, Comparative analysis of long DNA sequences by perelement information content using diﬀerent contexts, BMCBioinformatics, 2007.
[14] A. Hategan and I. Tabus, Protein is compressible, NORSIG,2004, 192–195.
[15] C.G. Nevill-Manning and I.H. Witten, Protein is incompress-ible, DCC, 1999, 257–266.
[16] E. Rivals, J.-P. Delahaye, M. Dauchet, and O. Delgrange, Aguaranteed compression scheme for repetitive DNA sequences,DCC, 1996, 453.
[17] A. Gupta and S. Agarwal, A scheme that facilitates searchingand partial decompression of textual documents, InternationalJournal of Advanced Computer Engineering, 1(2), 2008,99–109.
[18] M. Li and P. Vit’anyi, An introduction to Kolmogorov Com-plexity and its Applications (Springer Verlag, New York, 1993).
[19] T.C. Bell, J.C. Cleary, and I.H. Witten, TextCompression(Prentice Hall, Englewood Cliﬀs, NJ, 1990).
[20] I.H. Witten, A. Moﬀat, and T.C. Bell, Managing gigabytes:compressing and indexing documents and Images (MorganKaufman, San Francisco, 1999).
[21] A. Gupta, & S. Agarwal, Transforming the natural languagetext for improving compression performance, Lecture Notesin Electrical Engineering, Trends in Intelligent Systems andComputer Engineering (ISCE) Springer, 6, 2008, 637–644.
[22] A. Gupta and S. Agarwal, A novel approach of data compressionfor dynamic data, Proc. of IEEE 3rd Int. Conf. System ofSystems Engineering, California, USA, 2–4 June 2008.
[23] J. Ziv and A. Lempel, A universal algorithm for sequential datacompression, IEEE Transaction Information System, 23(3),1977, 337–342.
[24] J. Ziv and A. Lempel, Compression of individual sequences viavariable-rate coding, IEEE Transaction Information System,24(5), 1978, 530–536.
[25] F. Rubin, Experiments in text ﬁle compression, Communica-tions of the ACM, 19(11), 1976, 617–623.
[26] J.G. Wolﬀ, Recoding of natural language for economy oftransmission or storage, The Computer Journal, 21(1), 1978,42–44.
[27] J.A. Storer and T.G. Szymanski, Data compression via textualsubstitution, Journal of the ACM ssociation for ComputingMachinery, 29(4), 1982, 928–951.
[28] J.G. Cleary and W.J. Teahan, Unbounded length contexts forPPM, The Computer Journal, 40(2/3), 1997, 67–75.
[29] M. Burrows and D.J. Wheeler, A block sorting lossless datacompression algorithm, Technical Report, Digital EquipmentCorporation, Palo Alto, CA, 1994.
[30] P. Fenwick, The Burrows-Wheeler Transform for block sortingtext compression, The Computer Journal, 39(9), 1996, 731–740.
[31] A. Apostolico and S. Lonardi, Compression of biological se-quences by greedy oﬀ-line textual substitution, DCC, 2000,143–152.250O
[32] X. Chen, M. Li, B. Ma, and T. John, DNACompress: fast andeﬀective DNA sequence compression, Bioinformatics, 18(2),2002, 1696–1698.
[33] B. Behzadi and F.L. Fessant, DNA compression challengerevisited: a dynamic programming approach, CPM, 2005,190–200.
[34] F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens, Thecontext-tree weighting method: Basic properties, IEEE Trans-action Information Theory, 41(3), 1995, 653–664.
[35] I. Tabus, G. Korodi, and J. Rissanen, DNA sequence com-pression using the normalized maximum likelihood model fordiscrete regression, DCC, 2003, 253.
[36] G. Korodi and I. Tabus, An eﬃcient normalized maximumlikelihood algorithm for DNA sequence compression, ACMTransactions Information System, 23(1), 2005, 3–34.
[37] D. Loewenstern and P.N. Yianilos, Signiﬁcantly lower entropyestimates for natural DNA sequences, Computational Biology,6(1), 1999, 125–142.
[38] L. Allison, T. Edgoose, and T.I. Dix, Compression of stringswith approximate repeats, ISMB, 1998, 8–16.
[39] I.H. Witten, R.M. Neal, and J.G. Cleary, Arithmetic coding fordata compression, Communication ACM, 30(6), 1987, 520–540.
[40] T.C. Bell, J.G. Cleary, and I.H. Witten, Text Compression(Prentice Hall: Englewood Cliﬀs, NJ, 1990).
[41] E.S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates,Fast and ﬂexible word searching on compressed text, ACMTransaction on Information Systems, 18(2), 2000, 113–139.
[42] A. Moﬀat, Word based text compression, Software Practiceand Experience, 19(2), 1989, 185–198.
[43] O. Bat, M. Kimmel, and D.E. Axelrod, Cmputer simulation ofexpansions of DNA triplet repeats in the Fragile-X Syndromeand Huntington’s disease, Journal of Theoretical Biology, 188,1997, 53–67.
[44] M.D. Cao, T.I. Dix, L. Allison, and C. Mears, A simplestatistical algorithm for biological sequence compression, DataCompression Conference, 2007, 43–52.
[45] D. Loewenstern and P.N. Yianilos, Signiﬁcantly lower entropyestimates for natural DNA sequences, Journal of Computa-tional Biology, 6(1), 1999, 125–142.
[46] G. Manzini and M. Rastero, A simple and fast DNA compressor,Software: Practice and Experience, 34(14), 2004, 1397–1411.

Important Links:

Abstract
DOI: 10.2316/Journal.202.2011.3.202-3114
From Journal (202) International Journal of Computers and Applications - 2011

Go Back