Advancing Large-Scale Multiple Sequence Alignment through Divide-and-Conquer
Tandy Warnow, University of Illinois at Urbana-Champaign
Usage Details
Jian Peng, Tandy Warnow, Vladimir Smirnov, Paul Zaharias, Minhyuk ParkMultiple sequence alignment (MSA) is a major problem in biology and is used in many downstream problems, including phylogeny estimation and protein structure and function prediction. Yet, most methods have limited scalability, and error increases with the size of the dataset. In this project, we will work towards improving accuracy and scalability of MSA estimation methods. The first approach we will use is the development and improvement of methods that are based on Ensembles of Profile Hidden Markov Models, a novel approach that was developed in the Warnow lab and shown to have improved accuracy compared to single profile HMMs. Peng and Warnow were recently awarded a three-year grant from NSF BIO directorate to support the development of this Ensemble of Profile HMM approach and to extend it to the prediction of protein structure and function. The second approach we will use is the use of a novel divide-and-conquer approach, MAGUS, developed by PhD student Vladimir Smirnov, and recently accepted for publication in Bioinformatics. Both the Ensemble of Profile HMMs and MAGUS employ divide-and-conquer, making them eminently suited to large-scale problems, and to the use of parallel computing. Here we request an allocation on Blue Waters to enable us to perform this research, and to specifically explore scalability of these methods to large and ultra-large
datasets. This work will form the major part of the PhD dissertation for Vladimir Smirnov.