Question about Roofline of TensorCore GEMM

I am benchmarking a cudaTensorCoreGemm (half precision) on A100 GPU (PCIE, 40GB version).

GPU Device 0: "Ampere" with compute capability 8.0

M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 8192 (16 x 512)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm 
Time: 4.865024 ms
TFLOPS: 226.00
  • But when I used nsight compute to get the roofline of this kernel, the Floating Point Operation Roofline (Tensor Core) only gets a very low value (achieved) = 7.085 TFLOP/s (lock_control=base). The Peak FLOPS of roofline = 84 TFLOP/s
  • I remove the lock_control=None, and modify the GPU SM to max freq by nvidia-smi -ac=1215,1410. get the achieved value=9.95TFLOP/s, Peak FLOPS of roofline= 125TFLOP/s. (the acutal SM freq=1.14 cycle/us)

I wonder to know why the achieve TFLOPS (TensorCore) of Nsight Roofline is very low than application calculated. ?

I checked the Roofline Section (TensorCore), found the formula of nsight used:
# Tensor Roofline

# SM freq
 sm__cycles_elapsed.avg.per_second                                        cycle/usecond                         764.99

# Theoretical Tensor Instructions Executed
 sm__inst_executed_pipe_tensor.sum.peak_sustained                            inst/cycle                            216

 # derived__sm__inst_executed_pipe_tensor_x512

 1e6 * 764.99 * 216 * 512 = 84.6 TFLOP/s (764MHz)
 1e6 * 1.14 * 216 * 512 = 126.07 TFLOP/s  (1.14GHz)

But both 84.6 and 126 TFLOP/s are lower than 312 TFLOPS of A100 tensorCore.

Hi, @weipenghui_666

Sorry for the late response. Checked internally, we found the tensor roofline chart only works on GV100 now. We’ll correct this to make it clear ASAP.

