Pan-genomic analysis of complex human diseases

The overall goal of this project is to develop a pan-genomic analysis tool using Frequented Regions (FRs) and machine learning for the classification of disease morbidity in human genomes. To date, the primary tool for the genomic study of diseases is the genome-wide association study (GWAS), in which the segregation of specific alleles, usually single-nucleotide polymorphisms (SNPs), between affected and unaffected individuals is associated with the disease of interest. This method works well for identifying isolated variants associated with a condition, but it does not connect those variants together in combinations which are, perhaps, even more strongly associated with the condition. In addition, GWAS tends to focus on SNPs and is therefore less focused on structural variants. GWAS is usually performed on variants called against the human reference genome, and is therefore biased toward that reference.

Given the prevalence of complex heritable diseases and the need to better understand their genomic origin in order to improve treatments, investigation of new analysis techniques is highly justified. The tool proposed here combines two new analysis concepts: pan-genomic graphs, which represent individuals’ genomes with paths through a graph of DNA sequence nodes; and Frequented Regions, a novel way of describing genomic variation within a pan-genomic graph. We combine these two concepts with the growing field of machine learning in order to produce a supervised classification algorithm for human diseases.

Our approach features several important improvements to genomic analysis of disease: (1) A pan-genomic approach is unbiased toward the human reference if the graph is constructed strictly from individuals’ DNA; (2) FRs are well-suited to the study of complex diseases, since they represent arbitrary genomic structures in the graph; (3) FRs are sensitive to any type of variation, since they are arbitrary clusters of DNA sequence; and (4) our approach is sensitive to the entire genome if the pan-genome is built from whole genome sequencing (WGS) reads.

In order to build this tool, we will employ a highly parallel GPU-based computational strategy in order to handle the vast amount of data in a pan-genome representing hundreds or thousands of individuals' DNA.

Although it has a substantial risk of failure, our project, if successful, has the potential of greatly enhancing human disease studies with a distinct and complementary method.

Publications

About NCGR

The National Center for Genome Resources is a not-for-profit research institute that innovates, collaborates, and educates in the field of genomic data science. As leaders in DNA sequence analysis, we partner with government, industry, and academia to drive biological discovery in all kingdoms of life. We deliver value through expertise in experimental design, software, computation, data integration and training a skilled workforce.

More

Contact

© 2019 National Center for Genome Resources. Privacy Policy | Terms of Use