Why the Compute Throughput's value is different from the actual Performance / Peak Performance

TherLf · September 10, 2022, 2:46pm

I want to build a roofline model for my kernels. So I launch the ncu with the command

ncu --csv --target-processes all --set roofline mpirun -n 1 ./run_pselinv_linux_release_v2.0 -H H3600.csc -file ./tmpfile

The roofline set collects enough data to build the roofline model. But I can’t figure out the meaning of each metrics clearly.

The Compute(SM) Throughput is collected by the metrics sm__throughput.avg.pct_of_peak_sustained_elapsed which is 0.64%. And I think it is the percentage of Peak Performance. But when I divide the Performance(6855693348.37) by the Peak Work(5080428410372), I get 0.13%, which is much lower than 0.64%.

Besides, I want to collect the FLOPS and memory usage in my kernel, not their throughput.

So my question is:

What is the real meaning of SM Throughput and Memory Throughput? Are they the percentage of Peak Work and Peak Traffic? By the way, Peak Work and Peak Traffic are Peak Performance and Peak Bandwidth of DRAM respectively, right?
To get the real FLOPS and memory usage of my kernel, I want to multiply the Compute(SM) Throughput and Peak Work to get the real time Performance. Then I multiply the real time Performance and elapsed time to get the FLOPS. So does to memory usage. Is my method correct?

I have been searching for this question for two days but still can’t get a clear answer.

Sincerely thank you for your help!

TherLf · September 14, 2022, 2:58pm

I find the answer from this question: Terminology used in Nsight Compute
In short, the SM Throughput and the Memory Throughput is the maximum of a series of metrics respectively. So I just tried to understand their meanings by their name, which is totally wrong.

TherLf · September 15, 2022, 7:15am

By the way, the correct way to collects FLOPS and memory usage of your model is in this lab: Roofline Model on NVIDIA GPUs :-)

jmarusarz · September 19, 2022, 1:57pm

I’m glad you were able to find the information you needed. If you have any other questions, feel free to submit another forum question.

yupei1 · October 4, 2022, 10:32pm

Thank you for pointing to the Roofline Model on NVIDIA GPUs lab. That was very helpful. But I am curious about the “512” number in the 512 x sm__inst_executed_pipe_tensor.sum FLOP calculation. I think this is specific to V100? since the h884 instruction is doing 512 FLOPs per instruction. And for A100, it should be 4096 for h16816, is this correct?

Another issue I have is when I collect information for one kernel in cutlass_profiler as the following, I am getting DRAM read that is less than the memory needed to store A B and C (all equal to 224, so needs 224^232 = 294KB). But the output shows a value of 225KB. For larger problems, this read is always larger than matrix storage. So is this a measurement error or ncu version mismatch or I am not reading the output correctly (output attached)? Thank you for your help!

I am using NCU Version 2021.2.2.0 (build 30282580) (public-release)
NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7
Using NVIDIA A100-SXM4-80GB GPU

allout.txt (25.1 KB)

sudo /usr/local/cuda/bin/ncu --target-processes all --metrics “sm__cycles_elapsed.avg,sm__cycles_elapsed.avg.per_second,sm__sass_thread_inst_executed_op_ffma_pred_on.sum,sm__sass_thread_inst_executed_op_hfma_pred_on.sum,sm__inst_executed_pipe_tensor.sum,dram__bytes.sum” cutlass_profiler --profiling-iterations=1 --verification-enabled=False --kernels=cutlass_tensorop_h16816gemm_256x128_32x3_nn --m=224 --n=224 --k=224 > allout.txt 2>&1

jmarusarz · October 13, 2022, 9:35pm

Thanks for reaching out. I need to investigate both of these issues and I will get back to you with more information.

jmarusarz · October 27, 2022, 8:38pm

Thanks for your patience. We’ve been discussing this internally. For the first issue, we believe this is probably a bug and the A100 Roofline should be using a larger factor, however we aren’t positive that 4096 is the exact number and we are continuing to work on this.

For the second part, this does seem to be inconsistent with what we would expect, but there isn’t enough data here to determine the root cause. Would it be possible to collect the detailed set of sections using the “–set detailed” flag for a small and large matrix and share the results? If you can’t share the full results, screenshots of the Memory Workload Analysis section would be most useful. We would like to try and identify where this difference could be occurring.

m_ali102 · October 28, 2022, 5:16am

@yupei1 I have the same doubts about 512 x sm__inst_executed_pipe_tensor.sum FLOP calculation in Ampere GPU. I think 2048 is the right factor for fp16 and 4096 is the one for int8 precisions.

My calculations came from the following analysis:

The A100 peak performance is 1024 & 2048 FMA/cycle/SM for FP16 & INT8, respectively.
This leads to 2048 and 4096 TOP/s for FP16 and INT8, respectively, since 1 FMA = 2 OP.
It would be great if @jmarusarz can confirm these numbers.

Topic		Replies	Views
IMMA roofline analysis in NSight Compute Nsight Compute	4	1170	August 17, 2023
Nsight Compute-Roofline chart Nsight Compute	12	1616	September 20, 2024
Making a roofline plot: understanding the raw counters Nsight Compute	4	166	September 20, 2024
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2543	May 15, 2024
Discrepancy in Tensor Core FP16 Performance Ceiling on H100 SXM Observed in Nsight Compute Nsight Compute	2	142	December 31, 2024
Calculation of Memory Bound nature vs Roofline numbers Nsight Compute	3	968	May 18, 2023
Understanding Memory Tables and Roofline Modell Nsight Compute	3	637	August 19, 2022
About the flops in ncu report Nsight Compute	11	3860	July 29, 2024
Question about Roofline of TensorCore GEMM Nsight Compute	3	1508	August 7, 2024
How to measure FLOPs of a cuda kernel function by using Nsight-Compute on A100 GPU? Nsight Compute kernel	2	921	August 16, 2024

Why the Compute Throughput's value is different from the actual Performance / Peak Performance

Related topics