Question about Roofline of TensorCore GEMM

weipenghui_666 · December 1, 2023, 8:06am

I am benchmarking a cudaTensorCoreGemm (half precision) on A100 GPU (PCIE, 40GB version).

The cudaTensorCoreGemm is from the NVIDIA “cuda-samples”, and get the performance=226TFLOP/s, whcih is close to 312TFLOP/s (from A100 white paper) , below is the run logs:
I see the TFLOPS = 2MN*K / time_used
example: cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm at master · NVIDIA/cuda-samples (github.com)

Initializing...
GPU Device 0: "Ampere" with compute capability 8.0

M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 8192 (16 x 512)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm 
Time: 4.865024 ms
TFLOPS: 226.00

But when I used nsight compute to get the roofline of this kernel, the Floating Point Operation Roofline (Tensor Core) only gets a very low value (achieved) = 7.085 TFLOP/s (lock_control=base). The Peak FLOPS of roofline = 84 TFLOP/s
I remove the lock_control=None, and modify the GPU SM to max freq by nvidia-smi -ac=1215,1410. get the achieved value=9.95TFLOP/s, Peak FLOPS of roofline= 125TFLOP/s. (the acutal SM freq=1.14 cycle/us)

I wonder to know why the achieve TFLOPS (TensorCore) of Nsight Roofline is very low than application calculated. ?

I checked the Roofline Section (TensorCore), found the formula of nsight used:
# Tensor Roofline

# SM freq
 sm__cycles_elapsed.avg.per_second                                        cycle/usecond                         764.99

# Theoretical Tensor Instructions Executed
 sm__inst_executed_pipe_tensor.sum.peak_sustained                            inst/cycle                            216

 # derived__sm__inst_executed_pipe_tensor_x512

 1e6 * 764.99 * 216 * 512 = 84.6 TFLOP/s (764MHz)
 1e6 * 1.14 * 216 * 512 = 126.07 TFLOP/s  (1.14GHz)

But both 84.6 and 126 TFLOP/s are lower than 312 TFLOPS of A100 tensorCore.

veraj · May 6, 2024, 10:00am

Hi, @weipenghui_666

Sorry for the late response. Checked internally, we found the tensor roofline chart only works on GV100 now. We’ll correct this to make it clear ASAP.

veraj · May 20, 2024, 10:00am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

veraj · August 7, 2024, 4:34am

We have shipped the improved version of the Tensor Core Roofline. If needed, please update to latest version to get full coverage of all supported Tensor Code Formats for all supported chips.

Thanks !

Topic		Replies	Views
Tensorcore roofline Nsight Compute	2	400	August 27, 2024
Discrepancy in Tensor Core FP16 Performance Ceiling on H100 SXM Observed in Nsight Compute Nsight Compute	2	301	December 31, 2024
Measuring T4 TensorCore Integer TOPS for roofline Nsight Compute	0	461	November 5, 2020
Roofline Tensor Core should be half but not float? Nsight Compute	3	1581	May 29, 2024
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	3699	May 15, 2024
I cant see roofline tensor core Nsight Compute	12	553	January 6, 2025
Different achieved values in Roofline Nsight Compute	3	669	June 8, 2023
About the flops in ncu report Nsight Compute	11	4345	July 29, 2024
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	2137	August 8, 2023
IMMA roofline analysis in NSight Compute Nsight Compute	4	1290	August 17, 2023

Question about Roofline of TensorCore GEMM

Related topics