Matrix Multiplication (GEMM) Register Tiling and Shared Memory Bandwidth Bound

Hi, I’m studying the SGEMM algorithm on CUDA, but I couldn’t figure out how shared memory bandwidth bottleneck is alleviated. I try to calculate the tiling size needed to achieve the theoretical peak floating-point performance (ignoring memory latency and only look at bandwidth).

Using V100 as the example (1.38GHz, 897GB/s DRAM, 80 SMs, 14.13 TFLOPs FP32), we can use shared memory tiling to enable the reuse of elements and fully alleviate the global memory bandwidth bottleneck by using a shared memory tile width bn >127:

897 GB/s > 14.13 TFlop/s * (4 byte/word * 2 word) / Flop * (1/127)

However, because the shared memory bandwidth is also a bottleneck, we need to use register blocking to alleviate it in a similar fashion. The shared memory bandwidth on V100 is:

1.41312 TB/s = 4 byte/bank * 32 bank/instruction * 1 instruction/cycle * 1.38GHz * 80

Then, we want to calculate the register tiling width needed. Denote register tiling width as rn, we want to satisfy:

14.13 TFlop/s * (4 byte/word * 2 word) / flop * (1/rn) <= 1.41312 TB/s

Which implies rn needs to be greater than 81. However this is clearly impossible, as a thread would need thousands of registers per thread to store outer product results. So the computation would be severly bottlenecked by shared memory bandwidth.

On the other hand, libraries like cuDNN can achieve 90% of the peak performance of a V100. How does it achieve such performance? Is there an error in my calculation or logic, or is there other optimization to reduce bandwidth requirement?

Sorry, I am not following this derivation. To the best of my knowledge, your description is generally correct, though: There are multiple levels of blocking, with the innermost level of blocking designed to make best possible use of the register file.

Best I know, at least some of the GEMM kernels in CUBLAS use handwritten SASS code for the major GPU architectures used in HPC to optimize register file access (the very large register file consists of multiple banks) and instruction selection and scheduling. NVIDIA does not make tools for programming at the SASS level publicly available. Conventional wisdom, applicable to all hardware platforms, says that compiled code is unlikely to achieve greater than 80% to 85% of theoretical perak throughput on GEMM.

Presumably CUDNN follows the same approach as CUBLAS of partially relying on hand-written kernels, and might even share code with it. It also makes use of Tensor Cores (specialized execution units for matrix products and dot products) where applicable. Programming at the SASS level may enable more efficient use of these than is available by other means, but I have no specific knowledge of that.

Below is how I derived the equations. For simplicity I was using square tiling. A tiling with tile width rn would enable an element to be reused rn times, which means it reduces the bandwidth requirment by rn times.

For single precision matrix multiplication computation, each multiplication in the dot products takes two multiplicants, and each of those multicants is a single-precision float. So if we want to achieve peak device of 14.13 TFlops and each operation needs to read two 4bytes operands, then the required bandwidth would be 14.13 * 2 * 4 = 113.04 TB/s. Tiling reduce this required bandwidth to (113.04 / rn) TB/s. Since the shared memory bandwidth is 1.413TB/s, we need rn to be larger than 81.

But these optimizations can only hide latency better and reduce bank conflicts to approach theoretical maximum right? They cannot break through the peak theoretical shared mem bandwidth. Tensor core also don’t do anything to alliviate bandwidth constraints… So my derivation must be wrong, but where is it wrong?

If the blocking is done correctly, GEMM peformance should be limited by register file bandwidth or instruction throughput, whichever comes first.

Tensor core use should reduce register bandwidth requirements because short dot-products are accumulated without requiring register file accesses for the temporary variables needed by a conventional discrete replacement.

Apologies if you’ve already seen this, but a good explanation of the various blocking levels, as they pertain to Cutlass, can be seen here.