Skip to Content

Sequence Similarity Networks for the Protein Universe

John Gerlt, University of Illinois at Urbana-Champaign

Usage Details

John Gerlt, Daniel Davidson, Ken Yokoyama

The focus of the Enzyme Function Initiative (EFI, enzymefunction.org) is to devise bioinformatic tools and bioinformatic databases that facilitate the assignment of functions to uncharacterized enzymes discovered in genome projects. To assist in achieving that goal, this project will use the Blue Waters petascale computing facility to precompute libraries of sequence similarity networks (SSNs) for the protein universe and to make these available to the community. The SSNs will be used to 1) analyze sequence-function relationships in large protein families and 2) generate strategies to assign functions to uncharacterized members of those families. The libraries of SSNs will be updated regularly and be disseminated to the scientific community via website maintained by the EFI (http://efi.igb.illinois.edu/est-precompute/).

This project will use Blue Waters to enable large-scale generation of complete libraries of sequence similarity networks (SSNs) for the complete set of protein families and superfamilies (the 14,831 homologous sequence-based families described by Pfam and the 2,738 homologous structure based superfamilies described by Gene3D/CATH, thereby providing the community with easy access to sequence-function space. The conservative estimate is that no more than 50% of the proteins that have been discovered in genome projects have reliable functional annotations that can be accessed by the community using the UniProtKB database. In the absence of reliable functional annotations for all proteins, the biomedical, pharmaceutical, and commercial potential provided by genome sequences cannot be realized. The EFIs goal is to provide tools and strategies that will allow the community to efficiently determine the functions of uncharacterized enzymes. The library of sequence similarity networks (SSNs) to be generated and regularly updated using Blue Waters will be an invaluable bioinformatics database/resource for focusing and guiding those experimental efforts.