I want to build a roofline model for my kernels. So I launch the ncu with the command
ncu --csv --target-processes all --set roofline mpirun -n 1 ./run_pselinv_linux_release_v2.0 -H H3600.csc -file ./tmpfile
The roofline set collects enough data to build the roofline model. But I can’t figure out the meaning of each metrics clearly.
Compute(SM) Throughput is collected by the metrics
sm__throughput.avg.pct_of_peak_sustained_elapsed which is 0.64%. And I think it is the percentage of Peak Performance. But when I divide the
Performance(6855693348.37) by the
Peak Work(5080428410372), I get 0.13%, which is much lower than 0.64%.
Besides, I want to collect the
memory usage in my kernel, not their throughput.
So my question is:
What is the real meaning of
SM Throughput and
Memory Throughput? Are they the percentage of
Peak Work and
Peak Traffic? By the way,
Peak Work and
Peak Traffic are
Peak Performance and
Peak Bandwidth of DRAM respectively, right?
To get the real
memory usage of my kernel, I want to multiply the
Compute(SM) Throughput and
Peak Work to get the
real time Performance. Then I multiply the
real time Performance and
elapsed time to get the
FLOPS. So does to memory usage. Is my method correct?
I have been searching for this question for two days but still can’t get a clear answer.
Sincerely thank you for your help!
I find the answer from this question: Terminology used in Nsight Compute
In short, the
SM Throughput and the
Memory Throughput is the maximum of a series of metrics respectively. So I just tried to understand their meanings by their name, which is totally wrong.
By the way, the correct way to collects FLOPS and memory usage of your model is in this lab: Roofline Model on NVIDIA GPUs :-)
I’m glad you were able to find the information you needed. If you have any other questions, feel free to submit another forum question.
Thank you for pointing to the Roofline Model on NVIDIA GPUs lab. That was very helpful. But I am curious about the “512” number in the 512 x sm__inst_executed_pipe_tensor.sum FLOP calculation. I think this is specific to V100? since the h884 instruction is doing 512 FLOPs per instruction. And for A100, it should be 4096 for h16816, is this correct?
Another issue I have is when I collect information for one kernel in cutlass_profiler as the following, I am getting DRAM read that is less than the memory needed to store A B and C (all equal to 224, so needs 224^232 = 294KB). But the output shows a value of 225KB. For larger problems, this read is always larger than matrix storage. So is this a measurement error or ncu version mismatch or I am not reading the output correctly (output attached)? Thank you for your help!
I am using NCU Version 2021.2.2.0 (build 30282580) (public-release)
NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7
Using NVIDIA A100-SXM4-80GB GPU
allout.txt (25.1 KB)
sudo /usr/local/cuda/bin/ncu --target-processes all --metrics “sm__cycles_elapsed.avg,sm__cycles_elapsed.avg.per_second,sm__sass_thread_inst_executed_op_ffma_pred_on.sum,sm__sass_thread_inst_executed_op_hfma_pred_on.sum,sm__inst_executed_pipe_tensor.sum,dram__bytes.sum” cutlass_profiler --profiling-iterations=1 --verification-enabled=False --kernels=cutlass_tensorop_h16816gemm_256x128_32x3_nn --m=224 --n=224 --k=224 > allout.txt 2>&1
Thanks for reaching out. I need to investigate both of these issues and I will get back to you with more information.
Thanks for your patience. We’ve been discussing this internally. For the first issue, we believe this is probably a bug and the A100 Roofline should be using a larger factor, however we aren’t positive that 4096 is the exact number and we are continuing to work on this.
For the second part, this does seem to be inconsistent with what we would expect, but there isn’t enough data here to determine the root cause. Would it be possible to collect the detailed set of sections using the “–set detailed” flag for a small and large matrix and share the results? If you can’t share the full results, screenshots of the Memory Workload Analysis section would be most useful. We would like to try and identify where this difference could be occurring.
@yupei1 I have the same doubts about 512 x sm__inst_executed_pipe_tensor.sum FLOP calculation in Ampere GPU. I think 2048 is the right factor for fp16 and 4096 is the one for int8 precisions.
My calculations came from the following analysis:
The A100 peak performance is 1024 & 2048 FMA/cycle/SM for FP16 & INT8, respectively.
This leads to 2048 and 4096 TOP/s for FP16 and INT8, respectively, since 1 FMA = 2 OP.
It would be great if @jmarusarz can confirm these numbers.