In this course we will develop a reduced and simplified version of the CUDA
BLAS library by implementing CUDA kernels for a few frequently used BLAS
functions. We will start from a base, unoptimized kernel implementation, and
gradually introduce optimizations to improve the efficiency and compare our
implementation to the state-of-the-art reference cuBLAS library.
Blue Waters is supported by the National Science Foundation (ACI-0725070 and ACI-1238993), the State of Illinois and the University of Illinois.