Petascale Application Improvement Discovery (PAID)
Inter-job interference, network congestion and task mapping on torus networks
Abhinav Bhatele, Lawrence Livermore National Laboratory
Network interference both within and across jobs can impact the overall performance of parallel applications significantly. We have been studying inter-job interference and network congestion on torus and dragonfly networks using tools developed at LLNL. Using Boxfish, we can visualize the placement of different jobs on a 3D torus network and the bytes passing through different links. Boxfish can provide visual cues into how job placement can slow down
certain jobs. Network congestion among the processes of a job can also impact performance. We have developed tools such as Rubik and Chizu to map MPI processes within a job to different nodes within the allocated partition to optimize communication. Finally, we have been using machine learning to identify the root causes of network congestion on Blue Gene/Q. We propose to use similar techniques to make sense of the performance monitoring data generated on Blue Waters using LDMS.
Application-Focused Parallel I/O with HDF5 on Blue Waters
Mike Folk, The Hierarchical Data Format (HDF) Group
Quincey Koziol, The Hierarchical Data Format (HDF) Group
HDF5 is a high-performance I/O middleware package designed to provide an object-oriented mechanism for efficiently storing science application data. We strive to help application developers maximize their I/O bandwidth, while producing files that can be analyzed within the rich HDF5 ecosystem.
Globus for research data management
Ian Foster, Argonne National Laboratory
Globus is software-as-a-service for research data management, used at dozens of institutions and national facilities for moving and sharing big data. Globus provides easy-to-use services and tools for research data management, enabling researchers to access advanced capabilities using just a Web browser. Globus transfer and sharing has been deployed on BlueWaters machines, and provides access to the filesystem, including HPSS. Globus has already added improvements upon file recall from tape, scaled transfer concurrency, endpoint load balancing and availability for BlueWaters, and further enhancements to meet the unique needs for such a large scale system are planned over the next 2 years. Recent additions to Globus are services for data publication and discovery that enable: publication of large research data sets with appropriate policies for all types of institutions and researchers; the ability to publish data using your own storage or cloud storage that you manage, without third party publishers; extensible metadata that describe the specific attributes for your field of research; publication and curation workflows that can be easily tailored to meet institutional requirements; public and restricted collections that give you complete control over who may access your published data; a rich discovery model that allows others to search and use your published data.
SPIRAL for Blue Waters
Franz Franchetti, Carnegie Mellon University
The SPIRAL system (www.spiral.net, www.spiralgen.com) is a software production system that automatically generates highly efficient software for important kernel functionality on modern processor architectures. It targets machines across the spectrum from embedded and mobile devices through desktop and server-class machines up to supercomputers. For signal and image processing and communication functions such as the fast Fourier transform (FFT), convolution/correlation, or a Viterbi decoder, SPIRAL has proven to automatically generate code that is tuned to the given processor and computing platform, and is outperforming expertly hand-tuned implementations. Performance metrics include computational efficiency and execution rate, energy efficiency, code size, or a composite of the above. In this poster we discuss how SPIRAL's approach can help Blue Waters science teams to speed up the spectral solver portion of their codes. The goal is to develop drop-in replacements for currently used libraries that provide the necessary speed-up.
PAID: Enhancing Scalability of Applications
William Gropp, University of Illinois at Urbana-Champaign
Many applications see an increase in MPI communication costs as they scale to greater numbers of processes. Performance analysis tools often suggest that the MPI communication routines are consuming more time. But in many cases, the root cause of the problem is a load imbalance, which can be caused by factors internal to the application (e.g., an uneven decomposition of work) or external (e.g., OS "noise", interference by other applications along communication paths). This project will build on existing tools, enhancing them where necessary, to make it easier to identify the root cause of the load-imbalance scaling problems. Methods to address these problems include the use of hybrid MPI+threads models, using locality-aware thread scheduling, and mybrid MPI+MPI approaches that take advantage of the new shared-memory features introduced in MPI-3.
Effective Use of Accelerators/Highly Parallel Heterogeneous Units
William Gropp, University of Illinois at Urbana-Champaign
I/O performance is often ignored in studing code scalability and performance, yet many studies have shown that many applications achieve less than one percent of available I/O performance and that many of these applications are limited by the large amount of time they spend performing I/O. With the installation on Darshan on Blue Waters, it is now possible to collect information about the I/O behavior of many applications. This talk will describe some recent findings from Darshan at other supercomputer installations, plans to identify I/O performance issues, and approaches to help applications portably improve the performance of their I/O with minimum impact on the application code.
Exploiting topology-awareness and load balancing on Blue Waters
Sanjay Kale, University of Illinois at Urbana-Champaign
Blue Waters is multi-PetaFlop production system with large number of cores executing a large set of distinct jobs at any given time. Scalability of a job is impacted by the dynamics of the jobs executing concurrently and the characteristics of the individual job. The impact of multiple jobs can be minimized by identifying the communication requirements of a job and placing the job appropriately in the system. At the same time, task mapping within an application execution can be used to improve scalability. Within a job, load balancing among CPUs and among CPUs and GPUs is a must to scale a large set of applications. This talk provides a brief overview of our ongoing and planned work that will help application teams make use of aforementioned techniques. I will also talk about the related software that will be made available for use of MPI and Charm++ programmers.
Automatic performance tuning
Robert Lucas, University of Southern California
Maximizing the performance of long running software is critical to reducing the time to discovery in science and making efficient use of unique assets such as Blue Waters. Performance tuning can be a tedious process, and it often has to be revisited as applications are ported from one HPC system to the next. Motivated by the success of automatic performance tuning for linear algebra (PHiPAC and ATLAS) and FFT (FFTW and SPIRAL) kernels, the DOE SciDAC-3 Institute for Sustained Performance, Energy, and Resilience (SUPER) has been integrating research in performance measurement, modeling, code transformation, and multi-objective optimization into a framework that enables automatic performance tuning of scientific software. We have applied this to a variety of application kernels, demonstrating that some performance optimizations can be automated. This talk will briefly discuss the state of SUPER's automatic performance tuning research and the opportunity it affords to Blue Waters applications.
Scalability and Performance Advances of Particle-in-Cell Codes on Modern HPC Platforms
William Tang and Bei Wang, Princeton University
The technology in our work-scope is the GTC-Princeton (GTC-P) code -- a highly scalable particle-in-cell (PIC) code which solves the 5D Vlasov-Poisson equation with efficient utilization of massively parallel computer architectures at the petascale and beyond. It incorporates benefits from computer science advances such as deploying multi-threading capability on modern computing systems. GTC-P's multiple levels of parallelism, including inter-node domain and particle decomposition, and intra-node shared memory partition, as well as vectorization within each core, have enabled pushing the scalability of the PIC method to extreme computational scales. It is capable of delivering discovery science for increasing problem size by effective utilization of the leading HPC systems worldwide, including NSF's Stampede and Blue Waters. Building on these experiences, we will work with Blue Waters to communicate/disseminate "best practices" for new efforts, including exploration of directives-enabled GPU kernels for improving portability and lowering the efforts to re-engineer applications for accelerators.