How to profile overall SM utilization of the program by Nsight Compute?

First, could I get time for each kernel bysm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second?
Second, since I can profile sm__throughput.avg.pct_of_peak_sustained_elapsed,how could I compute the program level utilization? Should I use kernel_time (from first question) * sm__throughput.avg.pct_of_peak_sustained_elapsed then divide by total program duration?


Are you looking for some type of value representing how much the SMs were used compared to the entire application (which may include CPU time etc…)? For example, if the program took 10 seconds, the kernel took 5 of those seconds, and during the kernel there was an average of 50% SM utilization, then the value you’re looking for is (0.5 x 5)/10 = 25%

If that’s the case, then you’re on the right track. The only thing I would mention is that instead of calculating kernel time with the formula above you could use the gpu__time_duration.sum which calculates it directly.

If you’re looking for something else, please clarify. Thanks.

1 Like

Hi jmarusarz.
Thanks, that is what I am looking for.
Also, if I plan to profile the memory bandwidth usage of the program, should I use gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed or gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed?


gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed is what you’re looking for if you want to know about the GPU DRAM usage. The other metric include other levels of the memory hierarchy like L1 and L2.

Hi jmarusarz,
I have a new question.
How could I have the SM occupancy?

I see there is a metric, Achieved Occupancy, which is it the ratio of the average active warps to the maximum active warps allocated for the kernel.

Does this mean if the kernel is allocated 10 wraps, and 6 wraps are used for computation in average, then Achieved Occupancy is 60%?

Or should I use the metric Speed of Light SM [%] x Achieved Occupancy to get the actually occupied wraps (threads)?


Achieved Occupancy is active warps/active cycles. The value represents how many warps were active on average for a given cycle. For example, on GA100 this would be between 0 and 16. This occupancy can be impacted by the way your application divides the work and also hardware resource limitations like the register file, shared memory etc…

Speed of Light SM [%] x Achieved Occupancy isn’t really a calculation we would use.

Do you mean there is no way to get the kernel SM occupancy?

I’m not sure what you mean by “kernel SM occupancy”. Could you expand on that term to explain exactly what you are looking for?

Hi Jmarusarz,
Do you mean the max_warps_per_sm for a100 is 16?

Also, I find a formula about Achieved Occupancy, which equal to (Active_warps / Active_cycles) / max_warps_per_sm.

As you said before, current metric, achieved_occupancy, is equal to (active_warps / active_cycles) right?

No, max_warps_per_sm for a100 64. The 16 is warps per warp scheduler and there are 4 per SM.
On a100 (Active_warps / Active_cycles) / max_warps_per_sm is going to be a percentage, while (active_warps / active_cycles) will be a fraction between 1 and 16.