A NOVEL APPROACH FOR COMPRESSING DNA SEQUENCES USING SEMI-STATISTICAL COMPRESSOR

Ashutosh Gupta and Suneeta Agarwal

References

  1. [1] S. Grumbach and F. Tahi, Compression of DNA sequences,DCC, 1993, 340–350.
  2. [2] A. Gupta, V. Rishiwal, and S. Agarwal, Efficient storage ofmassive biological sequences in compact form, Proc. of 3rdIntl. Conf. Contemporary Computing-Part II, 2010 Communi-cations in Computer and Information Science Series, Springer,Noida, India, 95, Aug 9–11, 2010.
  3. [3] N. Kamel, Panel: Data and knowledge bases for genomemapping: What lies ahead?, Proc. Intl. Very Large Databases,1991.
  4. [4] L. Stern, L. Allison, R.L. Coppel, and T.I. Dix, Discoveringpatterns in plasmodium falciparum genomic DNA, Molecular& Biochemical Parasitology, 118, 2001, 175–186.
  5. [5] D.R. Powell, L. Allison, and T.I. Dix, Modelling-alignmentfor non-random sequences, Advances in Artificial Intelligence,2004, 203–214.
  6. [6] X. Chen, S. Kwong, and M. Li, A compression algorithm forDNA sequences and its applications in genome comparison,RECOMB, 2000, 107.
  7. [7] S. Grumbach and F. Tahi, A new challenge for compressionalgorithms: genetic sequences. Information, Process Manage-ment, 30(6), 1994, 875–866.
  8. [8] T. Matsumoto, K. Sadakane, and H. Imai, Biological sequencecompression algorithms, Genome Informatics, 11, 2000, 43–52.
  9. [9] D. Adjeroh and F. Nan, On compressibility of protein sequences,DCC, 2006, 422–434.
  10. [10] D.M. Boulton and C.S. Wallace, The information content ofa multistate distribution, Theoretical Biology, 23(2), 1969,269–278.
  11. [11] J.G. Cleary and I.H. Witten, Data compression using adaptivecoding and partial string matching, IEEE Transaction andCommunication, COM-32(4), 1984, 396–402.
  12. [12] T.I. Dix, D.R. Powell, L. Allison, S. Jaeger, J. Bernal, and L.Stern, Exploring long DNA sequences by information content,Probabilistic Modeling and Machine Learning in Structural andSystems Biology, Workshop Proc, 2006, 97–102.
  13. [13] T.I. Dix, D.R. Powell, L. Allison, S. Jaeger, J. Bernal, andL. Stern, Comparative analysis of long DNA sequences by perelement information content using different contexts, BMCBioinformatics, 2007.
  14. [14] A. Hategan and I. Tabus, Protein is compressible, NORSIG,2004, 192–195.
  15. [15] C.G. Nevill-Manning and I.H. Witten, Protein is incompress-ible, DCC, 1999, 257–266.
  16. [16] E. Rivals, J.-P. Delahaye, M. Dauchet, and O. Delgrange, Aguaranteed compression scheme for repetitive DNA sequences,DCC, 1996, 453.
  17. [17] A. Gupta and S. Agarwal, A scheme that facilitates searchingand partial decompression of textual documents, InternationalJournal of Advanced Computer Engineering, 1(2), 2008,99–109.
  18. [18] M. Li and P. Vit’anyi, An introduction to Kolmogorov Com-plexity and its Applications (Springer Verlag, New York, 1993).
  19. [19] T.C. Bell, J.C. Cleary, and I.H. Witten, TextCompression(Prentice Hall, Englewood Cliffs, NJ, 1990).
  20. [20] I.H. Witten, A. Moffat, and T.C. Bell, Managing gigabytes:compressing and indexing documents and Images (MorganKaufman, San Francisco, 1999).
  21. [21] A. Gupta, & S. Agarwal, Transforming the natural languagetext for improving compression performance, Lecture Notesin Electrical Engineering, Trends in Intelligent Systems andComputer Engineering (ISCE) Springer, 6, 2008, 637–644.
  22. [22] A. Gupta and S. Agarwal, A novel approach of data compressionfor dynamic data, Proc. of IEEE 3rd Int. Conf. System ofSystems Engineering, California, USA, 2–4 June 2008.
  23. [23] J. Ziv and A. Lempel, A universal algorithm for sequential datacompression, IEEE Transaction Information System, 23(3),1977, 337–342.
  24. [24] J. Ziv and A. Lempel, Compression of individual sequences viavariable-rate coding, IEEE Transaction Information System,24(5), 1978, 530–536.
  25. [25] F. Rubin, Experiments in text file compression, Communica-tions of the ACM, 19(11), 1976, 617–623.
  26. [26] J.G. Wolff, Recoding of natural language for economy oftransmission or storage, The Computer Journal, 21(1), 1978,42–44.
  27. [27] J.A. Storer and T.G. Szymanski, Data compression via textualsubstitution, Journal of the ACM ssociation for ComputingMachinery, 29(4), 1982, 928–951.
  28. [28] J.G. Cleary and W.J. Teahan, Unbounded length contexts forPPM, The Computer Journal, 40(2/3), 1997, 67–75.
  29. [29] M. Burrows and D.J. Wheeler, A block sorting lossless datacompression algorithm, Technical Report, Digital EquipmentCorporation, Palo Alto, CA, 1994.
  30. [30] P. Fenwick, The Burrows-Wheeler Transform for block sortingtext compression, The Computer Journal, 39(9), 1996, 731–740.
  31. [31] A. Apostolico and S. Lonardi, Compression of biological se-quences by greedy off-line textual substitution, DCC, 2000,143–152.250O
  32. [32] X. Chen, M. Li, B. Ma, and T. John, DNACompress: fast andeffective DNA sequence compression, Bioinformatics, 18(2),2002, 1696–1698.
  33. [33] B. Behzadi and F.L. Fessant, DNA compression challengerevisited: a dynamic programming approach, CPM, 2005,190–200.
  34. [34] F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens, Thecontext-tree weighting method: Basic properties, IEEE Trans-action Information Theory, 41(3), 1995, 653–664.
  35. [35] I. Tabus, G. Korodi, and J. Rissanen, DNA sequence com-pression using the normalized maximum likelihood model fordiscrete regression, DCC, 2003, 253.
  36. [36] G. Korodi and I. Tabus, An efficient normalized maximumlikelihood algorithm for DNA sequence compression, ACMTransactions Information System, 23(1), 2005, 3–34.
  37. [37] D. Loewenstern and P.N. Yianilos, Significantly lower entropyestimates for natural DNA sequences, Computational Biology,6(1), 1999, 125–142.
  38. [38] L. Allison, T. Edgoose, and T.I. Dix, Compression of stringswith approximate repeats, ISMB, 1998, 8–16.
  39. [39] I.H. Witten, R.M. Neal, and J.G. Cleary, Arithmetic coding fordata compression, Communication ACM, 30(6), 1987, 520–540.
  40. [40] T.C. Bell, J.G. Cleary, and I.H. Witten, Text Compression(Prentice Hall: Englewood Cliffs, NJ, 1990).
  41. [41] E.S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates,Fast and flexible word searching on compressed text, ACMTransaction on Information Systems, 18(2), 2000, 113–139.
  42. [42] A. Moffat, Word based text compression, Software Practiceand Experience, 19(2), 1989, 185–198.
  43. [43] O. Bat, M. Kimmel, and D.E. Axelrod, Cmputer simulation ofexpansions of DNA triplet repeats in the Fragile-X Syndromeand Huntington’s disease, Journal of Theoretical Biology, 188,1997, 53–67.
  44. [44] M.D. Cao, T.I. Dix, L. Allison, and C. Mears, A simplestatistical algorithm for biological sequence compression, DataCompression Conference, 2007, 43–52.
  45. [45] D. Loewenstern and P.N. Yianilos, Significantly lower entropyestimates for natural DNA sequences, Journal of Computa-tional Biology, 6(1), 1999, 125–142.
  46. [46] G. Manzini and M. Rastero, A simple and fast DNA compressor,Software: Practice and Experience, 34(14), 2004, 1397–1411.

Important Links:

Go Back