Petascale Application Improvement Discovery (PAID) 


Inter-job interference, network congestion and task mapping on torus networks


Abhinav Bhatele, Lawrence Livermore National Laboratory

Network interference both within and across jobs can impact the overall performance of parallel applications significantly. We have been studying inter-job interference and network congestion on torus and dragonfly networks using tools developed at LLNL. Using Boxfish, we can visualize the placement of different jobs on a 3D torus network and the bytes passing through different links. Boxfish can provide visual cues into how job placement can slow down

certain jobs. Network congestion among the processes of a job can also impact performance. We have developed tools such as Rubik and Chizu to map MPI processes within a job to different nodes within the allocated partition to optimize communication. Finally, we have been using machine learning to identify the root causes of network congestion on Blue Gene/Q. We propose to use similar techniques to make sense of the performance monitoring data generated on Blue Waters using LDMS.


View presentation PDF


Application-Focused Parallel I/O with HDF5 on Blue Waters

Mike Folk, The Hierarchical Data Format (HDF) Group

Quincey Koziol, The Hierarchical Data Format (HDF) Group

HDF5 is a high-performance I/O middleware package designed to provide an object-oriented mechanism for efficiently storing science application data. We strive to help application developers maximize their I/O bandwidth, while producing files that can be analyzed within the rich HDF5 ecosystem.


View presentation PDF


Globus for research data management


Ian Foster, Argonne National Laboratory

Globus is software-as-a-service for research data management, used at dozens of institutions and national facilities for moving and sharing big data. Globus provides easy-to-use services and tools for research data management, enabling researchers to access advanced capabilities using just a Web browser. Globus transfer and sharing has been deployed on BlueWaters machines, and provides access to the filesystem, including HPSS.  Globus has already added improvements upon file recall from tape, scaled transfer concurrency, endpoint load balancing and availability for BlueWaters, and further enhancements to meet the unique needs for such a large scale system are planned over the next 2 years. Recent additions to Globus are services for data publication and discovery that enable: publication of large research data sets with appropriate policies for all types of institutions and researchers; the ability to publish data using your own storage or cloud storage that you manage, without third party publishers; extensible metadata that describe the specific attributes for your field of research; publication and curation workflows that can be easily tailored to meet institutional requirements; public and restricted collections that give you complete control over who may access your published data; a rich discovery model that allows others to search and use your published data.


View presentation PDF


SPIRAL for Blue Waters


Franz Franchetti, Carnegie Mellon University

The SPIRAL system ( is a software production system that automatically generates highly efficient software for important kernel functionality on modern processor architectures. It targets machines across the spectrum from embedded and mobile devices through desktop and server-class machines up to supercomputers. For signal and image processing and communication functions such as the fast Fourier transform (FFT), convolution/correlation, or a Viterbi decoder, SPIRAL has proven to automatically generate code that is tuned to the given processor and computing platform, and is outperforming expertly hand-tuned implementations. Performance metrics include computational efficiency and execution rate, energy efficiency, code size, or a composite of the above. In this poster we discuss how SPIRAL's approach can help Blue Waters science teams to speed up the spectral solver portion of their codes. The goal is to develop drop-in replacements for currently used libraries that provide the necessary speed-up.


View presentation PDF


PAID: Enhancing Scalability of Applications


William Gropp, University of Illinois at Urbana-Champaign

Many applications see an increase in MPI communication costs as they scale to greater numbers of processes. Performance analysis tools often suggest that the MPI communication routines are consuming more time. But in many cases, the root cause of the problem is a load imbalance, which can be caused by factors internal to the application (e.g., an uneven decomposition of work) or external (e.g., OS "noise", interference by other applications along communication paths). This project will build on existing tools, enhancing them where necessary, to make it easier to identify the root cause of the load-imbalance scaling problems. Methods to address these problems include the use of hybrid MPI+threads models, using locality-aware thread scheduling, and mybrid MPI+MPI approaches that take advantage of the new shared-memory features introduced in MPI-3.


View presentation PDF


Effective Use of Accelerators/Highly Parallel Heterogeneous Units

Wen-mei Hwu, University of Illinois at Urbana-Champaign
An increasing portion of the top supercomputers in the world, including Blue Waters, have heterogeneous CPU-GPU computational units. While a handful of the science teams can already use GPUs in their production applications, there is still significant room for growing use. This program for enabling the science teams to make effective use of GPUs consists of two major components. The first is to make full use of vendor and community compiler technology, now defined by the OpenACC, OpenCL and C++AMP standards, introduce accelerator-based library capabilities for the science teams' applications, and provide support and enhancement for GPU-enabled performance and analysis tools. This will significantly reduce the programming effort and enhance code maintainability associated with the use of GPUs. The second aspect is to provide expert support to the science teams through hands-on workshops and individualized collaboration programs. The goal of these efforts is not to develop new compiler technology but rather to help science teams take advantage of the most promising, mature, or experimental compiler-associated capabilities from Cray, NVIDIA, MultcoreWare, the University of Illinois, and other institutions, such as the Barcelona Supercomputing Center. Furthermore, the activity will provide detailed feedback to OpenACC compiler providers in order to allow them to enable the compilers to produce more efficient code. In addition to the OpenACC-compliant, OpenMP-like compiler from Cray, the PGI FORTRAN compiler, and the Thrust C++ template library from NVIDIA, this project will provide and improve the C++AMP Compiler and the MxPA OpenCL Compiler to reduce the barrier to using the GPUs in the Cray system.
PAID: Parallel I/O Performance


William Gropp, University of Illinois at Urbana-Champaign

I/O performance is often ignored in studing code scalability and performance, yet many studies have shown that many applications achieve less than one percent of available I/O performance and that many of these applications are limited by the large amount of time they spend performing I/O.  With the installation on Darshan on Blue Waters, it is now possible to collect information about the I/O behavior of many applications.  This talk will describe some recent findings from Darshan at other supercomputer installations, plans to identify I/O performance issues, and approaches to help applications portably improve the performance of their I/O with minimum impact on the application code.


View presentation PDF


Exploiting topology-awareness and load balancing on Blue Waters


Sanjay Kale, University of Illinois at Urbana-Champaign

Blue Waters is multi-PetaFlop production system with large number of cores executing a large set of distinct jobs at any given time. Scalability of a job is impacted by the dynamics of the jobs executing concurrently and the characteristics of the individual job. The impact of multiple jobs can be minimized by identifying the communication requirements of a job and placing the job appropriately in the system. At the same time, task mapping within an application execution can be used to improve scalability. Within a job, load balancing among CPUs and among CPUs and GPUs is a must to scale a large set of applications. This talk provides a brief overview of our ongoing and planned work that will help application teams make use of aforementioned techniques. I will also talk about the related software that will be made available for use of MPI and Charm++ programmers.


View presentation PDF - Topology

View presentation PDF - Load Balancing


Automatic performance tuning


Robert Lucas, University of Southern California

Maximizing the performance of long running software is critical to reducing the time to discovery in science and making efficient use of unique assets such as Blue Waters. Performance tuning can be a tedious process, and it often has to be revisited as applications are ported from one HPC system to the next. Motivated by the success of automatic performance tuning for linear algebra (PHiPAC and ATLAS) and FFT (FFTW and SPIRAL) kernels, the DOE SciDAC-3 Institute for Sustained Performance, Energy, and Resilience (SUPER) has been integrating research in performance measurement, modeling, code transformation, and multi-objective optimization into a framework that enables automatic performance tuning of scientific software. We have applied this to a variety of application kernels, demonstrating that some performance optimizations can be automated.  This talk will briefly discuss the state of SUPER's automatic performance tuning research and the opportunity it affords to Blue Waters applications.


View presentation PDF


Scalability and Performance Advances of Particle-in-Cell Codes on Modern HPC Platforms


William Tang and Bei Wang, Princeton University

The technology in our work-scope is the GTC-Princeton (GTC-P) code -- a highly scalable particle-in-cell (PIC) code which solves the 5D Vlasov-Poisson equation with efficient utilization of massively parallel computer architectures at the petascale and beyond.  It incorporates benefits from computer science advances such as deploying multi-threading capability on modern computing systems. GTC-P's multiple levels of parallelism, including inter-node domain and particle decomposition, and intra-node shared memory partition, as well as vectorization within each core, have enabled pushing the scalability of the PIC method to extreme computational scales.  It is capable of delivering discovery science for increasing problem size by effective utilization of the leading HPC systems worldwide, including NSF's Stampede and Blue Waters.  Building on these experiences, we will work with Blue Waters to communicate/disseminate "best practices" for new efforts, including exploration of directives-enabled GPU kernels for improving portability and lowering the efforts to re-engineer applications for accelerators.


View presentation PDF