Compression of Pangenomic DNA Sequence Data

DNA is being sequenced at an unprecedented pace and scale. Consequently, there are two problems that afflict numerous fields across science and industry: a lack of methods for efficiently storing and transmitting large collections of DNA sequence data, and a lack of representations that facilitate their analysis at scale. The prior increasingly burdens researchers with monetary and time costs, while the latter stymies innovation and insight in fields ranging from agriculture to medicine. This project aims to address both problems simultaneously by leveraging the proliferation of collections of DNA sequence data composed of a single species, or pangenomes, to develop new methods for the compression of DNA sequence data that enable analysis directly on the compressed collections. The primary goal of this project is to provide a software toolkit that researchers can use for the compression and analysis of large collections of DNA sequence data, including updating the compressed collections over time as new DNA sequence data are generated and existing data are revised.


There are currently no publications listed.

