Petascale Application Improvement Discovery

Blue Waters is about more than just hardware. The Blue Waters project strives to create an ecosystem of resources, services, and activities that propel computational science and engineering toward the next decade of computing technology. One aspect of this is the Petascale Application Improvement Discovery (PAID) program.

PAID targets the introduction of new, fundamental application approaches in addition to optimization of current applications for Blue Waters, to increase the knowledge of and use of "best practices" for highly scalable computing and data analysis. PAID is based on input from and observations of dozens of Blue Waters science and engineering teams, the adjustments and improvements teams have been making to their applications, and the projected use and challenges over the next decade as derived from detailed interactions with many of the team leaders.

PAID provides funds for Improvement Method Enablers (IMEs) to work with science and engineering teams to help create and implement application improvements technologies.

PLEASE NOTE: If you do not receive a response from a Blue Waters PAID IME within one workday, please email: help+bwpaid@ncsa.illinois.edu.

Submit a Proposal to work with an Improvement Method Enabler

Improvement Method Enablers

Best Practice Identification, Dissemination and Implementation
Effective Use of Accelerators/Highly Parallel Heterogeneous Units
Globus Online
Model Based Code Refactoring and Auto Tuning
Parallel I/O Performance
Scalability and Load Balancing
Science Team Support and Improvements for HDF5 on Blue Waters
SPIRAL FFT
Topology Aware Implementation | Automatic Communication Pattern Detection

Best Practice Identification, Dissemination, and Implementation

Team Leaders: William Tang, Princeton University

Contact information: Email: wtang@pppl.gov; Website: w3.pppl.gov/theory/tang.html

Based on the significant experiences and lessons learned in developing the modern GTC-Princeton (GTC-P) code, we plan to provide information of interest and usefulness to the Blue Waters science applications community by accumulating, creating, and applying "best practices" for efforts needed to develop portable and efficient science codes across diverse architectures. The GTC-P code is a highly scalable particle-in-cell code used for studying micro-turbulence transport in tokamaks. It is a representative, discovery-science-capable 3D particle-in-cell code that has been successfully ported and optimized on a wide range of multi-petaflops platforms worldwide at the full or near-to-full capability of leading HPC systems, including NSF's "Blue Waters" and "Stampede." Associated benefits for portability come from the fact that GTC-P is not critically dependent on any third-party libraries. Various strategies employed to optimize performance, maximize parallelism, and utilize accelerator technology include multiple levels of decomposition for increasing scalability, choice of data layout for maximizing data reuse, leveraging GPU and Xeon Phi accelerators, and using hybrid programming models (MPI+OpenMP+optionally CUDA). In particular, we will exploit the trade-off of portability vs. speedup in making effective use of GPU between using CUDA and compiler directives such as OpenACC and OpenMP4.0

View Presentation pdf Gather-Scatter Best Practices Tutorial

Effective Use of Accelerators/Highly Parallel Heterogeneous Units

Team Leader: Wen-mei Hwu, University of Illinois at Urbana-Champaign

Contact information: Email: w-hwu@illinois.edu; Website: impact.crhc.illinois.edu/People/Hwu/hwu.aspx

Additional contact(s): John Larson, jlarson1@illinois.edu

An increasing portion of the top supercomputers in the world, including Blue Waters, have heterogeneous CPU-GPU computational units. While a handful of the science teams can already use GPUs in their production applications, there is still significant room for growing use. This program for enabling the science teams to make effective use of GPUs consists of two major components. The first is to make full use of vendor and community compiler technology, now defined by the OpenACC, OpenCL and C++AMP standards, introduce accelerator-based library capabilities for the science teams' applications, and provide support and enhancement for GPU-enabled performance and analysis tools. This will significantly reduce the programming effort and enhance code maintainability associated with the use of GPUs.

The second aspect is to provide expert support to the science teams through hands-on workshops and individualized collaboration programs. The goal of these efforts is not to develop new compiler technology but rather to help science teams take advantage of the most promising, mature, or experimental compiler-associated capabilities from Cray, NVIDIA, MultcoreWare, the University of Illinois, and other institutions, such as the Barcelona Supercomputing Center.

Furthermore, the activity will provide detailed feedback to OpenACC compiler providers in order to allow them to enable the compilers to produce more efficient code. In addition to the OpenACC-compliant, OpenMP-like compiler from Cray, the PGI FORTRAN compiler, and the Thrust C++ template library from NVIDIA, this project will provide and improve the C++AMP Compiler and the MxPA OpenCL Compiler to reduce the barrier to using the GPUs in the Cray system.

View presentation PDF

Globus Online

Team Lead: Ian Foster, Argonne National Laboratory

Contact information: Email: foster@anl.gov; Website: www.ianfoster.org

Additional contact(s): Rachana Ananthakrishnan, rachana.uc@gmail.com

Globus is software-as-a-service for research data management, used at dozens of institutions and national facilities for moving and sharing big data. Globus provides easy-to-use services and tools for research data management, enabling researchers to access advanced capabilities using just a Web browser. Globus transfer and sharing has been deployed on Blue Waters and provides access to the filesystem, including HPSS. Globus has already added improvements upon file recall from tape, scaled transfer concurrency, endpoint load balancing and availability for Blue Waters, and further enhancements to meet the unique needs for such a large-scale system are planned over the next two years. Recent additions to Globus are services for data publication and discovery that enable: publication of large research data sets with appropriate policies for all types of institutions and researchers; the ability to publish data using your own storage or cloud storage that you manage, without third-party publishers; extensible metadata that describe the specific attributes for your field of research; publication and curation workflows that can be easily tailored to meet institutional requirements; public and restricted collections that give you complete control over who may access your published data; a rich discovery model that allows others to search and use your published data.

Model-based Code refactoring and auto tuning

Team Leaders: Mary Hall, University of Utah

Contact information: Email: mhall@cs.utah.edu

The Department of Energy's SciDAC-3 Institute for Sustained Performance, Energy and Resilience (SUPER), aims to ensure that computational scientists can successfully exploit the current and emerging generation of high-performance computing systems by providing application scientists with strategies and tools to productively maximize performance and working with application teams to use and implement the tools. Dr. Mary Hall leads the SUPER research effort that focuses on compiler-based approaches to obtaining high performance on state-of-the-art architectures, including multi-cores, GPUs, and petascale platforms. This group is developing autotuning compiler technology to systematically map application code to these diverse architectures and make efficient use of heterogeneous resources in both today's and future extreme-scale systems. SUPER will assist Blue Waters teams in tuning and refactoring their applications.

View presentation video

Parallel I/O Performance

Team Lead: William Gropp, University of Illinois at Urbana-Champaign

Contact information: Email: wgropp@illinois.edu; Website: wgropp.cs.illinois.edu

Is I/O limiting your ability to do your science? Recent examination of I/O performance at several supercomputing centers showed that many applications were I/O limited; some were unaware that they could do much better. This project will be extending some existing tools to help identify sources of I/O performance problems and providing new tools to help improve I/O performance with minimum impact on applications. This project is looking for teams interested in (a) evaluating I/O performance and choice of I/O methods, (b)providing feedback on preferred I/O workflow (how you wish that it would work, not just how you've chosen to handle I/O in the pursuit of adequate performance), and (c) exploring the use of alternative I/O approaches including automatic performance tuning methods.

Scalability and Load Balancing

Team Lead: Sanjay Kale, University of Illinois at Urbana-Champaign

Team Member(s): Nikhil Jain

Contact information: Email: kale@illinois.edu; Website: charm.cs.uiuc.edu

Additional contact(s): Harshitha Menon, gplkrsh2@illinois.edu

Load balancing can improve the performance of many parallel applications. Irregularity in a problem causes different processors to finish their workloads at different times, leading to idle time waiting for laggards to complete. Load balancing partitions problems in an intelligent way, ideally assigning an equivalent amount of work to every processor. For complicated problems, the load can vary dynamically as a program progresses, for example if cells migrate or a wave propagates, changing location in the problem domain. For these cases, balancing load requires introspectively monitoring the program as it runs to determine how to optimally move and balance work. Additionally, work must be balanced across processing elements of varying performance and characteristics, such as between CPUs and GPUs.

We plan on creating a generic load balancing library based on the load balancers of Charm++. This library will provide load balancing decisions to applications, given information on object layout, current load, and communication pattern. We will analyze how to balance load across heterogeneous systems and develop new strategies to accomplish heterogeneous load balancing. Also, we are open for consultation with application teams regarding load balancing, and will provide advice, suggestions, and guidance on what and how to balance.

For more information: http://charm.cs.uiuc.edu/research/ldbal

Science Team Support for HDF5 on Blue Waters

Team Lead: Gerd Heber The HDF Group

Contact information: gheber@hdfgroup.org

The HDF Group has a long and mutually beneficial history with NCSA and the University of Illinois helping to improve the I/O performance of large-scale applications. In order to optimize the use of the HDF software on the Blue Waters system, the Blue Waters project selected the HDF Group to:

Make assessments and improvements for serial and parallel versions of HDF5.
Provide expedited resolution of high priority HDF5 defects and performance requirements.
Provide active engagement by dedicated HDF5 experts with science teams as they enhance their applications to improve I/O performance.
Perform an assessment of an application's needs, consisting of an audit of an application's current I/O use, planning and assessments of potential improvements, and recommendations for further improvements.
Support development efforts based on specific applications needs of the science teams on improving the performance of HDF software to meet their application's needs, ranging from providing simple but extremely useful advice, such as better organization of HDF5 files to improve I/O, to extensive projects involving teams of developers.
Research and implement trace-based I/O autotuning support with science teams, as appropriate for their needs.

Further information about the project, including information about using HDF5 on Blue Waters, science team support, and science team highlights using HDF5, can be found at: https://ncsa-bw.atlassian.net/wiki/display/HDFBW

HDF Survey

Please sign into the portal in order to particiapte in the HDF survery or you will not be able to see the survey URL. All Blue Waters Science and Engineering teams are eligible to participate.

A survey addressing the I/O usage and needs on Blue Waters of the science and engineering teams was developed by the HDFGroup for the PAID project. This survey is aimed at current and future Blue Waters partnerts in order to assess the current level of knowledge and satisfaction of I/O on Blue Waters. All teams are encourged to fill out the survey. The survey can be found at http://goo.gl/forms/py2k5Xgw4d

If you experience any problems while completing the survey, please contact help@hdfgroup.org.

Spiral FFT

Team Lead: Franz Franchetti, Carnegie Mellon University

Contact information: franzf@ece.cmu.edu; Website: users.ece.cmu.edu/~franzf/

Other Contacts: Jason Larkin <jason.larkin@spiralgen.com>;

SPIRAL is an automatic program generation system that, entirely autonomously, generates platform-tuned libraries of signal processing transform, such as the discrete Fourier transform, discrete cosine transform, and many others. The generated software is of similar performance as the best hand-tuned implementations available (Intel MKL and IPP, IBM ESSL) for functionality that is available for comparison. SPIRAL addresses one of the key problems in numerical software and hardware development: how to achieve close to optimal performance with reasonable coding effort. In the domain of linear transforms, and for standard multicore platforms, Spiral has achieved complete automation: the computer generation of general input-size, vectorized, parallel libraries. Spiral also demonstrated its applicability for large supercomputers (fastest Global FFT on 128K nodes of Blue Gene/P) and with GPUs.

The effort is to work with the Spiral group at Carnegie Mellon University and SpiralGen, Inc., to improve their code generation for the Cray XE/XK system—both on a node basis and with the Gemini Torus. Further, the Spiral team, facilitated by the guidance of Blue Waters staff, will engage with selected science teams to implement their technology and improve the team's applications. The main targets we will investigate will be 1) generation of optimized FFT kernels for XE6 nodes (AMD Interlagos) and the XK7 (NVIDIA Kepler) to serve as building blocks for high core performance and, 2) generation of convolution-like operations to minimize communication cost across nodes and between the CPU and GPU memory space. SpiralGen, Inc. will perform the main development while CMU will provide scientific consulting and interaction with the application team.

Topology Aware Implementations | Automatic Communication Pattern Detection

Team Lead: Sanjay Kale, University of Illinois at Urbana-Champaign

Team Member(s): Nikhil Jain

Contact information: Email: kale@illinois.edu; Website: charm.cs.uiuc.edu

Additional contact(s): Nikhil Jain, nikhil.jain@acm.org

The goal of the topology project is to develop methods and supporting tools that can automatically and intelligently assist science teams and their users in identifying the communication characteristics of their application, and categorizing them into job-types. At a high level, the potential categories of such a classification include near-neighbor-dominant, global-communication-dominant, communication-indifferent, etc.

Building upon the topology-aware mapping library developed during deployment of Blue Waters, we will add more capabilities, task mapping functions, and ease-of-use features to the tool. The most important among them is to transparently determine the scheduler job-type to which an application belongs (based on its communication behavior).

The new topology-aware scheduler on Blue Waters is expected to yield best results if the submitted jobs are able to correctly identify their communication characteristic category, and hence the utility of these tools that can assist users in finding the right job-type. For example, these tools will be useful in deciding if an application performs minimal communication and should be scheduled without regard for topology, or if the application performs near-neighbor communication and should be allocated a compact partition whose communication is not affected by other jobs. The other important utility of these tools will be to help the application developers in identifying communication overheads and the performance bugs induced by them in their applications. Finally, these tools will help application users perform topology aware mapping of their codes at the job start up as well as during an execution if the application supports migration.