I am benchmarking a cudaTensorCoreGemm (half precision) on A100 GPU (PCIE, 40GB version).
- The cudaTensorCoreGemm is from the NVIDIA “cuda-samples”, and get the performance=226TFLOP/s, whcih is close to 312TFLOP/s (from A100 white paper) , below is the run logs:
I see the TFLOPS = 2MN*K / time_used
example: cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm at master · NVIDIA/cuda-samples (github.com)
Initializing...
GPU Device 0: "Ampere" with compute capability 8.0
M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 8192 (16 x 512)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm
Time: 4.865024 ms
TFLOPS: 226.00
- But when I used nsight compute to get the roofline of this kernel, the Floating Point Operation Roofline (Tensor Core) only gets a very low value (achieved) = 7.085 TFLOP/s (lock_control=base). The Peak FLOPS of roofline = 84 TFLOP/s
- I remove the lock_control=None, and modify the GPU SM to max freq by
nvidia-smi -ac=1215,1410
. get the achieved value=9.95TFLOP/s, Peak FLOPS of roofline= 125TFLOP/s. (the acutal SM freq=1.14 cycle/us)
I wonder to know why the achieve TFLOPS (TensorCore) of Nsight Roofline is very low than application calculated. ?
I checked the Roofline Section (TensorCore), found the formula of nsight used:
# Tensor Roofline
# SM freq
sm__cycles_elapsed.avg.per_second cycle/usecond 764.99
# Theoretical Tensor Instructions Executed
sm__inst_executed_pipe_tensor.sum.peak_sustained inst/cycle 216
# derived__sm__inst_executed_pipe_tensor_x512
1e6 * 764.99 * 216 * 512 = 84.6 TFLOP/s (764MHz)
1e6 * 1.14 * 216 * 512 = 126.07 TFLOP/s (1.14GHz)
But both 84.6 and 126 TFLOP/s are lower than 312 TFLOPS of A100 tensorCore.