Statistical Phylogeny Estimation on Large Heterogeneous Datasets
Phylogenies, also called evolutionary trees, are graphical models of how a set of species or genes evolved
from a common ancestor. The inference of these phylogenies from genomic data is a basic step in answering
many biological questions. Although large phylogenies are an important tool, building them from
genomic data is challenging, as the best methods attempt to solve NP-hard problems (typically based on
maximum likelihood under sequence evolution models) and cannot scale to large datasets.
Our prior Blue Waters allocations have developed techniques that enable highly accurate estimations of multiple
sequence alignments, gene trees, and genome-scale species trees, and also two divide-and-conquer techniques
that enable highly accurate methods to scale to large datasets. The divide-and-conquer methods have been
shown to improve scalability, reduce running time, and maintain accuracy for species tree estimation from
multi-locus datasets, and are very promising. However, recent studies have demonstrated that biological
sequence evolution is neither stationary nor homogeneous, and that maximum likelihood methods under
standard sequence evolution models (which are typically both stationary and homogeneous) can produce
inaccurate alignments and trees.
In this next Blue Waters allocation, we propose three activities: (1) to further improve the divideand-
conquer techniques for scaling phylogeny estimation methods to scale to large datasets, (2) adapt a
recent deep learning approach for phylogeny estimation to enable estimation under non-stationary and
non-homogeneous sequence evolution models, and (3) implement the best of these methods for distributedmemory
parallel computing and compare them with respect to accuracy and scalability.