Skip to Content
Petascale Computing Institute
Agenda
Mon Tue Wed Thu Fri
Registration
includes list of Host Sites
Presenters
Resources
Instructions, Slides, Links
Additional Learning Resources
Registrant Locations

FAQ

Call For Host Sites

A/V Plan for Host Sites

Institute Organizers

Wednesday


CUDA. Part 1/3
Presenter: John E. Stone, UIUC
CUDA Slides
  • Introduction to CUDA programming model, key abstractions and terminology
  • CUDA thread model, differences w/ other programming systems
  • CUDA resource management intro (malloc/free/memcpy etc)
  • Mapping parallelism to grids/blocks/warps/threads, indexing work by thread IDs
  • Anatomy of basic CUDA kernels, comparison with serial code, loop nests, and so on, work through simple
  • examples
CUDA. Part 2/3 (CUDA and OpenACC)
Presenter: John E. Stone, UIUC
  • Execution of grids/blocks/warps/threads, divergence, etc.
  • Memory-bandwidth-bound kernels vs. arithmetic bound kernels, concepts and strategies
  • Memory systems, performance traits and requirements, optimizations
    • Global memory, coalescing, SOA vs. AOS, broadcasts of reads to multiple threads, use of vector
    • intrinsic types for higher bandwidth
    • Shared memory, bank conflicts, use for AOS to SOA conversion
    • Collective operations and synchronization basics, use of shared memory
    • Other memory systems: constant cache, 1D/2D/3D textures, host-mapped memory over
    • PCIe/NVLink, Peer-to-Peer memory accesses and the like
    • Atomic operations
  • Quick overview of GPU occupancy, register usage, launch configurations, and other kernel tuning concepts
  • Exciting new features in CUDA 10.x and beyond


Resources at NERSC
Presenter: Woo-Sun Yang, NERSC Slides


CUDA. Part 3/3 (Hands-on session)
Presenter: Dmitry Liakh, OLCF
Abstract
In this course we will develop a reduced and simplified version of the CUDA BLAS library by implementing CUDA kernels for a few frequently used BLAS functions. We will start from a base, unoptimized kernel implementation, and gradually introduce optimizations to improve the efficiency and compare our implementation to the state-of-the-art reference cuBLAS library.