Optimizing a distributed-memory parallel code for constructing ultra-large phylogenetic trees on Blue Waters
Bill Gropp, University of Illinois at Urbana-Champaign
Usage Details
Bill Gropp, Erin MolloyThe organization of molecular sequences into evolutionary trees or phylogenies enables scientists to classify fragmentary molecular sequences from environmental samples and to identify previously unknown microbes. The leading approaches to phylogenetic inference require a multiple sequence alignment to be estimated on the full dataset, and so, do not scale to datasets with very large numbers of sequences.
This past year, we developed a new divide-and-conquer approach, called TERADACTAL, which bypasses the creation of a multiple sequence alignment on the full dataset and enables the construction of ultra-large phylogenetic trees on supercomputers. Currently, there is no practical de novo method for building evolutionary trees on millions of sequences, and so, building an optimized version of TERADACTAL will be a major step forward in building the Tree of Life — a grand challenge in evolutionary biology. This allocation will be used to optimize a phase of TERADACTAL, that currently requires all-to-all communication.
In particular, we are interested in testing approaches to avoid this all-to-all communication phase without reducing the accuracy of TERADACTAL. We are also interested in studying how subset size (varying from 200 to 5,000 sequences) impacts communication as well as method accuracy.