First, could I get time for each kernel by`sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second`

?

Second, since I can profile `sm__throughput.avg.pct_of_peak_sustained_elapsed`

,how could I compute the program level utilization? Should I use `kernel_time (from first question) * sm__throughput.avg.pct_of_peak_sustained_elapsed`

then divide by `total program duration`

?

Are you looking for some type of value representing how much the SMs were used compared to the entire application (which may include CPU time etcâ€¦)? For example, if the program took 10 seconds, the kernel took 5 of those seconds, and during the kernel there was an average of 50% SM utilization, then the value youâ€™re looking for is (0.5 x 5)/10 = 25%

If thatâ€™s the case, then youâ€™re on the right track. The only thing I would mention is that instead of calculating kernel time with the formula above you could use the gpu__time_duration.sum which calculates it directly.

If youâ€™re looking for something else, please clarify. Thanks.

Hi jmarusarz.

Thanks, that is what I am looking for.

Also, if I plan to profile the memory bandwidth usage of the program, should I use `gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed`

or `gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed`

?

gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed is what youâ€™re looking for if you want to know about the GPU DRAM usage. The other metric include other levels of the memory hierarchy like L1 and L2.

Hi jmarusarz,

I have a new question.

How could I have the SM occupancy?

I see there is a metric, **Achieved Occupancy**, which is it the ratio of the average active warps to the maximum active warps allocated for the kernel.

Does this mean if the kernel is allocated 10 wraps, and 6 wraps are used for computation in average, then **Achieved Occupancy** is 60%?

Or should I use the metric **Speed of Light SM [%]** x **Achieved Occupancy** to get the actually occupied wraps (threads)?

Achieved Occupancy is active warps/active cycles. The value represents how many warps were active on average for a given cycle. For example, on GA100 this would be between 0 and 16. This occupancy can be impacted by the way your application divides the work and also hardware resource limitations like the register file, shared memory etcâ€¦

**Speed of Light SM [%]** x **Achieved Occupancy** isnâ€™t really a calculation we would use.

Do you mean there is no way to get the kernel SM occupancy?

Iâ€™m not sure what you mean by â€śkernel SM occupancyâ€ť. Could you expand on that term to explain exactly what you are looking for?

Hi Jmarusarz,

Do you mean the max_warps_per_sm for a100 is 16?

Also, I find a formula about Achieved Occupancy, which equal to (Active_warps / Active_cycles) / max_warps_per_sm.

As you said before, current metric, achieved_occupancy, is equal to (active_warps / active_cycles) right?

No, max_warps_per_sm for a100 64. The 16 is warps per warp scheduler and there are 4 per SM.

On a100 *(Active_warps / Active_cycles) / max_warps_per_sm* is going to be a percentage, while *(active_warps / active_cycles)* will be a fraction between 1 and 16.