Question about Roofline of TensorCore GEMM

I am benchmarking a cudaTensorCoreGemm (half precision) on A100 GPU (PCIE, 40GB version).

GPU Device 0: "Ampere" with compute capability 8.0

M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 8192 (16 x 512)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm 
Time: 4.865024 ms
TFLOPS: 226.00
  • But when I used nsight compute to get the roofline of this kernel, the Floating Point Operation Roofline (Tensor Core) only gets a very low value (achieved) = 7.085 TFLOP/s (lock_control=base). The Peak FLOPS of roofline = 84 TFLOP/s
  • I remove the lock_control=None, and modify the GPU SM to max freq by nvidia-smi -ac=1215,1410. get the achieved value=9.95TFLOP/s, Peak FLOPS of roofline= 125TFLOP/s. (the acutal SM freq=1.14 cycle/us)

I wonder to know why the achieve TFLOPS (TensorCore) of Nsight Roofline is very low than application calculated. ?

I checked the Roofline Section (TensorCore), found the formula of nsight used:
# Tensor Roofline

# SM freq
 sm__cycles_elapsed.avg.per_second                                        cycle/usecond                         764.99

# Theoretical Tensor Instructions Executed
 sm__inst_executed_pipe_tensor.sum.peak_sustained                            inst/cycle                            216

 # derived__sm__inst_executed_pipe_tensor_x512

 1e6 * 764.99 * 216 * 512 = 84.6 TFLOP/s (764MHz)
 1e6 * 1.14 * 216 * 512 = 126.07 TFLOP/s  (1.14GHz)

But both 84.6 and 126 TFLOP/s are lower than 312 TFLOPS of A100 tensorCore.

Hi, @weipenghui_666

Sorry for the late response. Checked internally, we found the tensor roofline chart only works on GV100 now. We’ll correct this to make it clear ASAP.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.