Improving Homology Detection, Gene Binning, and Multiple Sequence Alignment
This project focuses on three computational problems in biomolecular sequence analysis where analyses of large datasets are both necessary and beyond the reach of current methods:
- Metagenomics, which is the analysis of environmental samples using shotgun sequencing data. These environmental samples can be drawn from different locations such as the human gut, soil, oceans, etc., and accurate analyses of these samples can have a large impact on human health and agriculture, among other practical issues. In addition, the exploration of environmental samples can lead to the discovery and identification of novel species and genes, or more generally the exploration of Microbial Dark Matter. Since microbes constitute a very large portion of life on earth, this is also an exploration of biodiversity in different environments, and so is connected to ecological research.
- Phylogenomics, which is the estimation of species phylogenies using multiple genomic regions. A basic analytical challenge is robustness in the face of heterogeneous phylogenies for different parts of the genome, due to events such as incomplete lineage sorting and horizontal gene transfer. Phylogenomic estimation focuses on species tree estimation, but most methods depend on multiple sequence alignments and phylogenetic trees for individual genes; hence, these gene-based analyses are also essential problems.
- Proteomics, and in particular the inference of multiple sequence alignments and trees for large numbers of diverse protein sequences. These alignments and trees enable improved estimations of protein structure and function, and so advances in methods for these alignments and trees have important downstream consequences.
These three problems are inherently linked, since accurate analyses of metagenomic datasets, and the correct identification of novel species and/or genes, obviously depends on having accurate species phylogenies, and—in less obvious ways—on having accurate phylogenetic trees and multiple sequence alignments for different genes within the genomes of the different species. Species trees in turn depend on gene trees, which depend on multiple sequence alignments. Thus, progress on the foundational questions of how to align sequences and infer phylogenies directly leads to improved metagenomic analysis. Finally, all these problems depend on statistical methods that tend to be computationally intensive (often NP-hard), and so require novel algorithmic techniques to enable highly accurate analyses of large datasets. Addressing these linked problems is the aim of this proposal.
Critical improvements in methods developed by the PI's group create opportunities for transformative improvements in accuracy and scalability for these problems. This project will develop new methods with the ability to analyze ultra-large datasets with high accuracy, and will also develop parallel implementations of these methods that can take advantage of the special architecture of Blue Waters. The result will be open-source software that can be used by biologists and clinicians, greatly advancing the state of the art in methods, and enabling breakthroughs in biological understanding. The proposed project is very much a continuation of the current project, with specific focused attention on improved techniques for metagenomics, and specifically in correctly identifying novel species and genes in environmental samples.