Blue Waters User Portal | Science Teams

Sequence Similarity Networks for the "Protein Universe"

John Gerlt, University of Illinois at Urbana-Champaign

Usage Details

Gloria Rendon, Nils Oberg, Liudmila Mainzer, John Gerlt, Daniel Davidson, Ken Yokoyama

The sequences for 88,588,026 proteins are available in the UniProt database (Release 2017_07; http://www.uniprot.org/); the number of sequences is increasing at the rate of 2.4%/month (doubling time 2.5 years). However, ~50% of the proteins have incorrect, uncertain, or unknown functions. This "explosion" in sequence data provides exciting opportunities for the biological community, but the data must be easily accessible so that experimental biologists, not just bioinformaticians, can leverage the large amount of information to devise experimental strategies for assigning the functions of the uncharacterized proteins discovered in genome projects. A starting point is to place the uncharacterized proteins in the context of sequence-function space in the 16,712 protein families that have been curated by the Release 31.0 of the Pfam database, thereby providing clues for prediction of their functions. To accomplish this task, we developed a community accessible web server (Enzyme Function Initiative-Enzyme Similarity Tool, EFI-EST) to generate sequence similarity networks (SSNs) for protein families. SSNs allow the user to visualize the pairwise sequence relationships between members of families and segregate the family into isofunctional groups.

The goal of this project is to develop and strategy and implement a pipeline for generating a frequently updated database (every eight weeks; with each release of the InterPro database) of SSNs for all Pfam families. The process for generating an SSN is straightforward, i.e., a pairwise sequence comparison using BLAST (because it is fast and familiar to biologists) followed by filtering the data with a user-specified sequence identity threshold. Rapid generation of the SSN database for this large number of protein families, many with a very large number of sequences, requires embarrassingly parallel calculations that can be performed with Blue Waters. The SSN database will be disseminated locally so that it is accessible to users of EFI-EST. We also will make the database available to the European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) for inclusion in their Enzyme Portal. The SSN database will establish Illinois as a world-class center for bioinformatics.