Developing CUDA kernels to push Tensor Cores to the Absolute Limit on NVIDIA A100

GTC 2020 S21745
Presenters: Andrew Kerr, NVIDIA
Abstract
NVIDIA Ampere GPU Architecture pushes the performance envelope by doubling the math throughput of Tensor Cores for mixed precision and also adds support for double precision, Tensor Float 32, and bfloat16 data types. We’ll describe how to implement high-performance CUDA kernels using Tensor Cores on A100, applying techniques such as register blocking, software pipelining, and carefully constructed memory layouts to avoid bank conflicts. Then we’ll describe abstractions for programming Tensor Cores available in CUTLASS, as well as other new features. This talk is intended for advanced CUDA C++ programmers who are eager to write kernels pushing Tensor Cores to peak performance. We recommend that you review previous presentations on this topic such as the introduction to CUTLASS (GTC 2018) and Programming Volta Tensor Cores in CUTLASS (GTC 2019).

Watch this session
Join in the conversation below.