Computational mapping of DNA-binding protein affinity landscapes
Transcription factor-DNA interactions play a crucial role in the regulatory networks of virtually all organisms. Knowledge of the binding sites of a particular transcription factor (TF) in the genome is crucially important for understanding its role in the transcriptional regulatory network, and thus mapping or predicting the binding sites of a wide array of transcription factors is a crucial goal in modern molecular biology. At present, for any particular transcription factor, researchers can use a simple model for the affinity of that transcription factor for different DNA sequences (typically in the form of a position weight matrix, or PWM) to predict binding sites for that factor throughout the genome; however, the PWMs themselves must be obtained through laborious and expensive experiments. Recent computational efforts to predict DNA-binding affinity landscapes based on atomistic molecular dynamics simulations have shown early promise, but could not be expanded beyond testing of a small number of changes to the binding site due to the computational requirements.
Using atomistic molecular dynamics simulations of transcription factor-DNA complexes, we will computationally map the DNA-binding affinity landscapes of a set of four human transcription factors. The key calculation to be performed is the change in binding free energy for the transcription factor to the DNA as a function of DNA sequence changes. For this purpose we will employ a novel method for non-equilibrium free energy calculations based on the Crooks-Gaussian intersection (CGI). CGI has previously been shown to yield excellent results in testing on a limit set of DNA sequence changes. In addition to its demonstrated accuracy, CGI has the benefit that most simulation time is spent sampling the (physically correct) endpoint states of the calculation rather than non-physical intermediates; thus, in addition to obtaining information on the changes in binding affinity, we will be able to obtain insight into the biophysical basis of interactions of each transcription factor with a variety of sequences. Our new method, multistate Crooks-Gaussian intersection (mCGI), builds on the strengths of CGI while generalizing the approach to more complex biomolecules with many energetically accessible structures.
The key computational challenge requiring the use of Blue Waters is the massive number and duration of simulations required to obtain a precise, quantitative map of the binding affinity of a TF for the relevant range of DNA sequences. Transcription factor binding sites tend to be 8-10 nucleotides in length, so in principle a comprehensive map of binding affinities would require consideration of 48-410 sequences. However, in practice, experimentally obtained maps show that only 40-100 sequences would need to be calculated to obtain an essentially complete landscape, as the remainder of the sequence space contains one or more interactions that are so unfavorable that those sequences make no meaningful contribution to TF binding. Thus, for each TF we will begin from a single known binding sequence and systematically map the changes in binding free energy for a progressive series of mutations, pruning all mutational paths which reduce the binding affinity by more than a specified threshold. Even with this strategy, however, obtaining accurate free energy calculations for ~100 mutations will require 10-20 microseconds of calculation on a system of several hundred thousand atoms for a single transcription factor. Such calculations are only possible using the massive resources provided by Blue Waters.
Once we have obtained binding affinity maps for our initially targeted transcription factors, we will compare them with experimental results on the same systems to validate (and refine as needed) computational protocols for allowing reliable in silico determination of affinity landscapes for DNA binding proteins. In addition, we will analyze our trajectories to obtain completely novel insight into the structural basis for these affinity landscapes, and catalog the effects of the binding of different transcription factors on DNA structure, which appears likely to play a key role in the interplay between different transcription factors regulating the same gene in vivo. Our work will both yield crucial insight into the molecular basis of gene regulation, and pave the way toward increasingly accurate computational models of biological processes.