Achieved occupancy reported at nsight compute

I have recently worked on the profiling of an application and puzzled by the achieved occupancy # reported at nsight compute.

Nsight compute reports active warps per scheduler in scheduler statistics section and achieved occupancy in occupancy section. My understanding is if we divide the active warps per scheduler by the maximum warps per scheduler, we will get an achieved occupancy, roughly the same as the achieved occupancy reported in occupancy section.

However, this is not the case for the application I am looking at. I am wondering what causes the difference. By checking the metrics used for active warps per scheduler and achieved occupancy, I found that active warps per scheduler uses smsp_warps_active.avg.per_cycle_active and achieved occupancy uses sm_warps_active.pct_of_peak_sustained_active. I am wondering if the first one is the average over all warp schedulers for a particular SM and the later the the average over all warp scheduler for all SMs? So if we see very different numbers for occupancy here, it may suggest load imbalance for different SMs?

The table shows a time line in elapsed cycles from 0 to 24 of a single SM with 4 SM sub-partitions (SMSP). A thread block is launched that has 256 threads == 8 warps. Each SMSP is allocated 2 warps. The higher warps exit early resulting in imbalance between the SMSP.

Nsight Compute is focused on single kernel profiling. The assumption is the GPU is active 100% of the elapsed cycles as the PM system will measure from the launch of the grid to the completion of the grid. The result is that Nsight Compute tends to use the cycles_active to convert to a percentage vs. cycles_elapsed. Timeline based tools such as Nsight Systems and Nsight Graphics GPU Trace use cycles_elapsed as there is no expectation that the GPU will be active.

sm__cycles_active increments if the SM has at least 1 active warp.
smsp__cycles_active increments if the SMSP has at least 1 active warp.

From the right side of the time table it can be observed that SMSP3 is idle >75% of the active cycles and 75% of the sm__cycles_active.

It is useful to compare SM level statistics to SMSP level statistics to determine load balancing issues.

This can be done using numerous comparisons.

First - Try to make sure all SMs are active during the measurement period. This may not always be possible if the grid is small. In this case you would want to try to overlap additional independent work.
sm__cycles_active.avg.pct_of_peak_sustained_elapsed (how often were SMs active)

Second - Try to make sure all SMSPs are active.
sm__cycles_active.avg.pct_of_peak_sustained_elapsed (how often were SMSPs active)
smsp__cycles_active.avg vs. sm__cycles_active.avg
smsp__cycles_active.min vs. sm__cycles_active.max

Third - Compare the occupancy between SMSP and SM to see how imbalanced the work may be.
smsp__warps_active.avg.pct_of_peak_sustained_elapsed vs. sm__warps_active.avg.pct_of_peak_sustained_elapsed
You can also compare .min vs. max

1 Like

Many thanks for the updates. This is very helpful to understand the discrepancy in the earlier observations.