Wednesday, June 28, 2017

GPU Architecture and Concepts

Presenter: John Stone, University of Illinois at Urbana-Champaign


  • Basic GPU hardware intro (3 slides)
  • Heterogenous computing concepts (5 slides)
  • GPU-accelerated application development strategies (10 slides)
    • GPU-accelerated HPC application development/optimization cycle
    • GPUs in the context of distributed memory message passing codes
    • GPU-accelerated libraries, frameworks, domain-specific langs ...
    • Directive-based parallelism, e.g., w/ OpenACC
    • Programmer-provided explicit parallelism, CUDA, OpenCL, etc.
    • Overview of advanced technologies: C++11, NVRTC, parallel STL, …
    • Overview of profiling debugging approaches
  • GPU hardware introduction, trends, futures (12 slides)
    • Throughput-oriented hardware, latency hiding, occupancy, and relation to SIMT concepts
    • GPU memory systems (on-board/on-chip, registers, caches, coalescing)
    • Computational thinking in the context of GPU hardware, e.g., “Scatter” vs. “Gather” algorithms and application to GPUs, use of data privatization schemes (e.g., for histograms)
    • GPU arithmetic hardware capabilities, mixed-precision, special fctns
    • Host-GPU, GPU P2P, and RDMA concepts/issues:
    • General concepts
    • GPU Unified Memory
    • Interactions with host NUMA
    • Zero-copy approaches, pinned memory, GPUDirect RDMA, P2P and NVLink,  …


Presenter: Justin Luitjens, NVIDIA


  • Why Use OpenACC?
  • Basic Profiling with PGProf
  • Parallelizing Loops with OpenACC
  • Controlling Data Movement
  • Simple Loop Optimizations

CUDA Programming

Presenter: John Stone, University of Illinois at Urbana-Champaign


  • Part 1
    • Introduction to CUDA programming model, key abstractions and terminology (5 slides)
    • CUDA thread model, differences w/ other programming systems (5 slides)
    • CUDA resource management intro (malloc/free/memcpy etc) (5 slides)
    • Mapping parallelism to grids/blocks/warps/threads, indexing work by thread IDs (5 slides)
    • Anatomy of basic CUDA kernels, comparison with serial code, loop nests, and so on, work through simple examples (10 slides)
  • Part 2
    • Execution of grids/blocks/warps/threads, divergence, etc … (5 slides)
    • Memory-bandwidth-bound kernels vs. arithmetic bound kernels, concepts and strategies (5 slides)
    • Memory systems, performance traits and requirements, optimizations (10 slides)
      • Global memory, coalescing, SOA vs. AOS, broadcasts of reads to multiple threads,  use of vector intrinsic types for higher bandwidth
      • Shared memory, bank conflicts, use for AOS to SOA conversion 
      • Collective operations and synchronization basics, use of shared memory
      • Other memory systems: constant cache, 1D/2D/3D textures, host-mapped memory over PCIe/NVLink, Peer-to-Peer memory accesses and the like
      • Atomic operations
    • Quick overview of GPU occupancy, register usage, launch configurations, and other kernel tuning concepts (5 slides)
    • Exciting new features in CUDA 9 (5 slides)

GPU Application Optimization and Scaling with Profiling and Debugging

Presenter: Fernanda Foertter, Oak Ridge Leadership Class Facility


This session will cover lessons learned about porting applications to GPU accelerated machines such as Titan and Blue Waters. Best practices include profile-driven development, what to look for in a profile, analysis of data structures and call stacks. It'll also look forward to future multi-gpu architectures such as Summit and how data movement will behave under such systems.