Skip to Content
Petascale Computing Institute
Mon Tue Wed Thu Fri
includes list of Host Sites
Instructions, Slides, Links
Additional Learning Resources
Registrant Locations


Call For Host Sites

A/V Plan for Host Sites

Institute Organizers

Wednesday, August 21, 2019

OpenACC (continued)

Presenter: Matt Norman, ORNL


The OpenACC presentation will focus on practical aspects of the OpenACC specification and how to implement a GPU code with directives. Aspects such as data movement, GPU routines with parallel loop and kernels directives, asynchronicity, and more advanced topics will be covered. However, the focus will be on the practical process of moving from a typical CPU-based code toward a refactored code that runs efficiently on GPUs. This will include aspects such as managing CPU threading, exposing threading, efficient test-driven development, portability concerns, and debugging. A small OpenACC code will also be used for hands-on training, and a larger code will also be made available for those with the desire to see OpenACC in a slightly more realistic application.


Presenter: John Stone, UIUC


The CUDA session will introduce the principle abstractions in the CUDA programming model that allow programmers to harness the throughput oriented computing hardware of GPUs to process hundreds of thousands of data parallel work items concurrently with high performance. The session will describe how to use CUDA to manage GPU memory and computing resources, execute work on the GPU, transferring input data between and results the host and the GPU. The session will emphasize the use of profiling and software analysis tools to inform software refactoring and GPU algorithm design decisions.

Part 1

  • Introduction to CUDA programming model, key abstractions and terminology
  • CUDA thread model, differences w/ other programming systems
  • CUDA resource management intro (malloc/free/memcpy etc)
  • Mapping parallelism to grids/blocks/warps/threads, indexing work by thread IDs
  • Anatomy of basic CUDA kernels, comparison with serial code, loop nests, and so on, work through simple
  • examples

Part 2

  • Execution of grids/blocks/warps/threads, divergence, etc.
  • Memory-bandwidth-bound kernels vs. arithmetic bound kernels, concepts and strategies
  • Memory systems, performance traits and requirements, optimizations
    • Global memory, coalescing, SOA vs. AOS, broadcasts of reads to multiple threads, use of vector
    • intrinsic types for higher bandwidth
    • Shared memory, bank conflicts, use for AOS to SOA conversion
    • Collective operations and synchronization basics, use of shared memory
    • Other memory systems: constant cache, 1D/2D/3D textures, host-mapped memory over
    • PCIe/NVLink, Peer-to-Peer memory accesses and the like
    • Atomic operations
  • Quick overview of GPU occupancy, register usage, launch configurations, and other kernel tuning concepts
  • Exciting new features in CUDA 10.x and beyond

CUDA Slides

Resources at NERSC
Presenter: Woo-Sun Yang, NERSC


CUDA and OpenACC

Presenter: John Stone, UIUC


  • GPU-accelerated HPC application development/optimization cycle
  • GPUs in the context of distributed memory message passing codes
  • Directive-based parallelism, e.g., w/ OpenACC
  • Programmer-provided explicit parallelism, CUDA, OpenCL, etc.
  • Overview of advanced technologies: C++11, NVRTC, parallel STL, etc.

Practical CUDA kernel optimization exercise

Presenter:Dmitry Liakh, OLCF


In this course we will develop a reduced and simplified version of the CUDA BLAS library by implementing CUDA kernels for a few frequently used BLAS functions. We will start from a base, unoptimized kernel implementation, and gradually introduce optimizations to improve the efficiency and compare our implementation to the state-of-the-art reference cuBLAS library.