Petascale Computing Institute
CUDA. Part 1/3
Presenter: John E. Stone, UIUC
  • Introduction to CUDA programming model, key abstractions and terminology
  • CUDA thread model, differences w/ other programming systems
  • CUDA resource management intro (malloc/free/memcpy etc)
  • Mapping parallelism to grids/blocks/warps/threads, indexing work by thread IDs
  • Anatomy of basic CUDA kernels, comparison with serial code, loop nests, and so on, work through simple
  • examples
CUDA. Part 2/3 (CUDA and OpenACC)
Presenter: John E. Stone, UIUC
  • Execution of grids/blocks/warps/threads, divergence, etc.
  • Memory-bandwidth-bound kernels vs. arithmetic bound kernels, concepts and strategies
  • Memory systems, performance traits and requirements, optimizations
    • Global memory, coalescing, SOA vs. AOS, broadcasts of reads to multiple threads, use of vector
    • intrinsic types for higher bandwidth
    • Shared memory, bank conflicts, use for AOS to SOA conversion
    • Collective operations and synchronization basics, use of shared memory
    • Other memory systems: constant cache, 1D/2D/3D textures, host-mapped memory over
    • PCIe/NVLink, Peer-to-Peer memory accesses and the like
    • Atomic operations
  • Quick overview of GPU occupancy, register usage, launch configurations, and other kernel tuning concepts
  • Exciting new features in CUDA 10.x and beyond

Resources at NERSC
Presenter: Woo-Sun Yang, NERSC Slides

CUDA. Part 3/3 (Hands-on session)
Presenter: Dmitry Liakh, OLCF
In this course we will develop a reduced and simplified version of the CUDA BLAS library by implementing CUDA kernels for a few frequently used BLAS functions. We will start from a base, unoptimized kernel implementation, and gradually introduce optimizations to improve the efficiency and compare our implementation to the state-of-the-art reference cuBLAS library.