Blue Waters User Portal | Science Teams

Large-scale gene tree and species tree estimation using Blue Waters

Tandy Warnow, University of Illinois at Urbana-Champaign

Usage Details

Erin Molloy, Tandy Warnow, Pranjal Vachaspati, Thien Le, Vladimir Smirnov

Phylogenies, also called “evolutionary trees,” are graphical models of how a set of species or genes evolved from a common ancestor. The inference of these phylogenies from genomic data is a basic step in answering many biological questions, for example, how species adapt to their environments, how genes co-evolve, how humans migrated across the globe, etc. Furthermore, phylogenies are critical for applied research, including protein function prediction, drug design, vaccine design, and taxonomic identification of bacteria/viruses in the human gut (with applications in human health), the soil (with applications in agriculture), and the air (with applications in biodefense).

Although large phylogenies are an important tool, building them from genomic data is challenging, as the best methods attempt to solve NP-hard problems and cannot scale to large datasets. With previous Blue Waters allocations, we have developed techniques that enable highly accurate estimations of multiple sequence alignments, gene trees, and genome-scale species trees as well as two divide-and-conquer techniques that enable highly accurate methods to scale to large datasets.

Using this allocation, we propose three activities: (1) to further improve the divide-and-conquer techniques for scaling phylogeny estimation methods to scale to large datasets, (2) implement the best of these methods for distributed-memory parallel computing, and (3) use the methods to construct large phylogenetic trees for different biological datasets. More specifically, we propose to build a large bacterial phylogeny and a large insect phylogeny, thus proposing potentially new evolutionary trees that could form the basis of new biological research discoveries.

All datasets and con- structed phylogenies will be made publicly available to the research community. In addition, the open-source software developed to build these large phylogenies will be made freely available so that researchers can use them to analyze new phylogenomic datasets, enabling breakthroughs in biological understanding.