Blue Waters User Portal

Wednesday, June 28, 2017

GPU Architecture and Concepts

Presenter: John Stone, University of Illinois at Urbana-Champaign

Abstract:

Basic GPU hardware intro (3 slides)
Heterogenous computing concepts (5 slides)
GPU-accelerated application development strategies (10 slides)
- GPU-accelerated HPC application development/optimization cycle
- GPUs in the context of distributed memory message passing codes
- GPU-accelerated libraries, frameworks, domain-specific langs ...
- Directive-based parallelism, e.g., w/ OpenACC
- Programmer-provided explicit parallelism, CUDA, OpenCL, etc.
- Overview of advanced technologies: C++11, NVRTC, parallel STL, …
- Overview of profiling debugging approaches
GPU hardware introduction, trends, futures (12 slides)
- Throughput-oriented hardware, latency hiding, occupancy, and relation to SIMT concepts
- GPU memory systems (on-board/on-chip, registers, caches, coalescing)
- Computational thinking in the context of GPU hardware, e.g., “Scatter” vs. “Gather” algorithms and application to GPUs, use of data privatization schemes (e.g., for histograms)
- GPU arithmetic hardware capabilities, mixed-precision, special fctns
- Host-GPU, GPU P2P, and RDMA concepts/issues:
- General concepts
- GPU Unified Memory
- Interactions with host NUMA
- Zero-copy approaches, pinned memory, GPUDirect RDMA, P2P and NVLink, …

OpenACC

Presenter: Justin Luitjens, NVIDIA

Abstract:

Why Use OpenACC?
Basic Profiling with PGProf
Parallelizing Loops with OpenACC
Controlling Data Movement
Simple Loop Optimizations

CUDA Programming

Presenter: John Stone, University of Illinois at Urbana-Champaign

Abstract:

Part 1
- Introduction to CUDA programming model, key abstractions and terminology (5 slides)
- CUDA thread model, differences w/ other programming systems (5 slides)
- CUDA resource management intro (malloc/free/memcpy etc) (5 slides)
- Mapping parallelism to grids/blocks/warps/threads, indexing work by thread IDs (5 slides)
- Anatomy of basic CUDA kernels, comparison with serial code, loop nests, and so on, work through simple examples (10 slides)

Part 2
- Execution of grids/blocks/warps/threads, divergence, etc … (5 slides)
- Memory-bandwidth-bound kernels vs. arithmetic bound kernels, concepts and strategies (5 slides)
- Memory systems, performance traits and requirements, optimizations (10 slides)
  - Global memory, coalescing, SOA vs. AOS, broadcasts of reads to multiple threads, use of vector intrinsic types for higher bandwidth
  - Shared memory, bank conflicts, use for AOS to SOA conversion
  - Collective operations and synchronization basics, use of shared memory
  - Other memory systems: constant cache, 1D/2D/3D textures, host-mapped memory over PCIe/NVLink, Peer-to-Peer memory accesses and the like
  - Atomic operations
- Quick overview of GPU occupancy, register usage, launch configurations, and other kernel tuning concepts (5 slides)
- Exciting new features in CUDA 9 (5 slides)

GPU Application Optimization and Scaling with Profiling and Debugging

Presenter: Fernanda Foertter, Oak Ridge Leadership Class Facility

Abstract:

This session will cover lessons learned about porting applications to GPU accelerated machines such as Titan and Blue Waters. Best practices include profile-driven development, what to look for in a profile, analysis of data structures and call stacks. It'll also look forward to future multi-gpu architectures such as Summit and how data movement will behave under such systems.

Blue Waters User Portal

Scaling To Petascale Institute

Agenda

Resources

Learning opportunities

Host Sites

Registration

Organizing Institutions

Wednesday, June 28, 2017