VIDEO
Presenter: John E. Stone , UIUC
CUDA Slides
Introduction to CUDA programming model, key abstractions and terminology
CUDA thread model, differences w/ other programming systems
CUDA resource management intro (malloc/free/memcpy etc)
Mapping parallelism to grids/blocks/warps/threads, indexing work by thread IDs
Anatomy of basic CUDA kernels, comparison with serial code, loop nests, and so on, work through simple
examples
VIDEO
Presenter: John E. Stone , UIUC
Execution of grids/blocks/warps/threads, divergence, etc.
Memory-bandwidth-bound kernels vs. arithmetic bound kernels, concepts and strategies
Memory systems, performance traits and requirements, optimizations
Global memory, coalescing, SOA vs. AOS, broadcasts of reads to multiple threads, use of vector
intrinsic types for higher bandwidth
Shared memory, bank conflicts, use for AOS to SOA conversion
Collective operations and synchronization basics, use of shared memory
Other memory systems: constant cache, 1D/2D/3D textures, host-mapped memory over
PCIe/NVLink, Peer-to-Peer memory accesses and the like
Atomic operations
Quick overview of GPU occupancy, register usage, launch configurations, and other kernel tuning concepts
Exciting new features in CUDA 10.x and beyond
VIDEO
Presenter: Woo-Sun Yang , NERSC
Slides
VIDEO
Presenter: Dmitry Liakh , OLCF
In this course we will develop a reduced and simplified version of the CUDA
BLAS library by implementing CUDA kernels for a few frequently used BLAS
functions. We will start from a base, unoptimized kernel implementation, and
gradually introduce optimizations to improve the efficiency and compare our
implementation to the state-of-the-art reference cuBLAS library.