Why the Compute Throughput's value is different from the actual Performance / Peak Performance

I want to build a roofline model for my kernels. So I launch the ncu with the command

ncu --csv --target-processes all --set roofline mpirun -n 1 ./run_pselinv_linux_release_v2.0 -H H3600.csc -file ./tmpfile

The roofline set collects enough data to build the roofline model. But I can’t figure out the meaning of each metrics clearly.

The Compute(SM) Throughput is collected by the metrics sm__throughput.avg.pct_of_peak_sustained_elapsed which is 0.64%. And I think it is the percentage of Peak Performance. But when I divide the Performance(6855693348.37) by the Peak Work(5080428410372), I get 0.13%, which is much lower than 0.64%.

Besides, I want to collect the FLOPS and memory usage in my kernel, not their throughput.

So my question is:

  1. What is the real meaning of SM Throughput and Memory Throughput? Are they the percentage of Peak Work and Peak Traffic? By the way, Peak Work and Peak Traffic are Peak Performance and Peak Bandwidth of DRAM respectively, right?

  2. To get the real FLOPS and memory usage of my kernel, I want to multiply the Compute(SM) Throughput and Peak Work to get the real time Performance. Then I multiply the real time Performance and elapsed time to get the FLOPS. So does to memory usage. Is my method correct?

I have been searching for this question for two days but still can’t get a clear answer.

Sincerely thank you for your help!

I find the answer from this question: Terminology used in Nsight Compute
In short, the SM Throughput and the Memory Throughput is the maximum of a series of metrics respectively. So I just tried to understand their meanings by their name, which is totally wrong.

1 Like

By the way, the correct way to collects FLOPS and memory usage of your model is in this lab: Roofline Model on NVIDIA GPUs :-)

I’m glad you were able to find the information you needed. If you have any other questions, feel free to submit another forum question.