Pangenomes: Developing algorithms and linking to phenotypes

The current trajectory of next generation sequencing improvements, including falling costs and increased read lengths and throughput, ensures that multiple genomes per species will be routine within the next decade. This project initiates work on a next generation of bioinformatics software that can exploit the increased information content available from multiple accessions and intelligently use the data for unbiased, species-wide analyses.

We develop pangenomic software algorithms and tools that can scale to complex eukaryotic organisms. These tools allow researchers to study large numbers of genome sequences from a single species to understand the genomic regions responsible for phenotypic adaptions such as the ability to adapt to different environments. Each individual's genomic sequence corresponds to path in a graph data structure called a De Bruijn graph, which are large and tangled and can have millions of nodes and edges. Our tool finds hotspots or frequented regions (FRs) in De Bruijn graphs representing regions shared across individuals, as well as regions that aren't frequented (unique to an individual).

We are developing software and machine learning techniques that can automatically filter shared and unique regions in a pangenome to identify the most interesting candidate regions. These tools will help researchers to discover regions that are conserved across evolutionary space, regions that are novel, regions that have diverged due to positive selection, and regions coding for phenotypic differences across the population.

Algorithms and software tools are available at


About NCGR

The National Center for Genome Resources is a not-for-profit research institute that innovates, collaborates, and educates in the field of genomic data science. As leaders in DNA sequence analysis, we partner with government, industry, and academia to drive biological discovery in all kingdoms of life. We deliver value through expertise in experimental design, software, computation, data integration and training a skilled workforce.



© 2021 National Center for Genome Resources. Privacy Policy | Terms of Use