Instrumenting Human Variant Calling Workflow on Blue Waters
If whole genome sequencing and analysis become part of the standard of care in many hospitals within the next few years, then human genetic variant calling will need to be performed on hundreds of incoming patients on any given day. At this scale, the standard workflow widely accepted in the research and medical community, will use thousands of nodes at a time and have I/O bottlenecks that could affect performance even on a major cluster like Blue Waters. In the previous allocation period we identified and documented these bottlenecks on a smaller scale. Our current project seeks to design tools and methods to overcome them, and test those tools at the scale of a computational facility able to serve genomic data analysis needs in a state like Illinois. Specifically, we aim to resolve the bottlenecks associated with the large number of small files created by the workflow, saturated I/O bandwidth for part of the workflow, and ensure a balanced data load on the file system.