A reference genome library

By Shaun Bushman, USDA-ARS

Much like a library contains books of different types and tons of information, a ‘reference genome’ is a library of all the sequences in that genome.  In plant genomes, there are tens of thousands of expressed genes, sequence motifs that pinpoint telomeres and centromeres, large swaths of repeat regions, and other pieces of DNA that do not code proteins but tend to affect gene function.  The recent improvements in DNA sequencing have allowed us to sequence the genome of hard fescue (Festuca brevipila).  It is a hexaploid with three diploid subgenomes, and each of those carries a full complement of genes necessary for plant function.  This three-fold redundancy likely contributes to hard fescue’s growth and adaptation.  By sequencing its full genome, we now have a reference library of the redundant genes and other DNA regions that contribute to that growth and function.

Grass plants evenly spaced in research plots
Hard fescue research plots at the University of Minnesota. Photo by Gary Deters.

Sequencing a genome is a complicated matter in grasses.  Their genome size, the redundancy mentioned above, and even their heterozygosity combine to make it a challenge to piece together individual sequence reads into chromosome-sized pieces.  However, new algorithms and sequencing methods from PacBio and others have helped tremendously in this process.  We sequence millions of individual pieces of DNA, each 15-20,000 nucleotides in length, then stitch them together into large scaffolds that approximate chromosomes.  By all comparisons done thus far, we have over 95% of all the genes accounted for in our hard fescue reference library.

With a reference library in hand, what can be done better? First of all, sorting out the three versions of any given gene, from the three diploid subgenomes, is critical.  Imagine finding a gene that helps to confer resistance to a disease, only to realize that there are three versions and you do not know which may be affecting your trait.  By knowing their respective locations, and the small sequence differences between them, we can tease out that knowledge.  That was nearly impossible before the advent of a reference genome.  The location of genes, the differences between genes, and the regulation of those genes are all part of the reference.  It is the first step in finding associations between gene mutations and traits that we care about.