Is the peak traffic value in the Roofline Model the peak bandwidth only related to GPU or achieved by the program?
In general, peak traffic in the Roofline Model is peak bandwidth only related to the hardware . It is also explained in ProfilingGuide as GPU memory transfer speed.
However, peak traffic in the default section file is
. lts__t_bytes.sum.peak_sustained/lts__cycles_elapsed.avg.per_second looks like the peak bandwidth that the program reached. In practice, different programs do have different peak traffic.
So how to understand peak traffi ?
If we need the peak traffic only related to GPU, what can we do?
The peak is only related to the GPU, not varying based on the activity of the workload. The names of the metrics may be confusing, but the peak_sustained metric is basically the “peak value that the GPU could possibly sustain regardless of workload”. It’s hardcoded per GPU. Not related to anything sustained by the application. The per_second metric is just used to calculate the cycles/sec (clockspeed) of the GPU which is needed for the various ratios.
At the top of the report details page there is a “SM Frequency” metric. Can you check what that value is for the 2 results? It’s likely they are different and the roof is calculated based on the observed frequency during the run.
Yes, I see the “SM Frequency” and “DRAM Frequency” are different between different between different applications. Is the “DRAM Frequency” equal to dram_cycles_elapsed.avg.per_second ?
Get it!Is there any way to get the maximum frequency of SM, L1 cache, L2 cache, and DRAM? Which frequency is the GPU Boost Clock written in the white paper?
You can use the nvidia-smi utility to query your GPUs possible clock rates for SM and Memory, see nvidia-smi clocks -h for more details. You can use nvidia-smi --lock-gpu-clocks/--reset-gpu-clocks in combination with ncu --clock-control none to have the clocks set external to Nsight Compute.