Theoretical and Achieved Occupancy metrics

Hi,

I’m trying to profile my application using nsys and ncu to create a performance overview. I output the trace as a SQL database, which gives me some information on the kernel’s occupancy and the device’s limits. It for example gives me the number of used threads and shared memory (CUPTI_ACTIVITY_KIND_KERNEL), which I can compare to the device’s limits (TARGET_GPU_INFO).

Now I saw that ncu also provides all occupancy-related metrics pre-computed:

Metric Explanation
launch__occupancy_limit_barriers Occupancy limit due to the number of used barriers.
launch__occupancy_limit_blocks Occupancy limit due to maximum number of blocks managable per SM.
launch__occupancy_limit_registers Occupancy limit due to register usage.
launch__occupancy_limit_shared_mem Occupancy limit due to shared memory usage.
launch__occupancy_limit_warps Occupancy limit due to block size.
launch__occupancy_per_barrier_count Number of active warps for given barrier count.
launch__occupancy_per_block_size Number of active warps for given block size.
launch__occupancy_per_cluster_size Number of active clusters for given cluster size.
launch__occupancy_per_register_count Number of active warps for given register count.
launch__occupancy_per_shared_mem_size Number of active warps for given shared memory size.

I had the idea to compute the theoretical occupancy as a percentage via:
min(launch__occupancy_limit_warps, launch__occupancy_limit_registers, launch__occupancy_limit_shared_mem, launch__occupancy_limit_blocks) / maxWarpsPerSM * 100

For one of my kernels, this results in 3.1 percent, build up of the following values:

Metric Value
launch__occupancy_limit_blocks 32.0
launch__occupancy_limit_registers 2.0
launch__occupancy_limit_shared_mem 6.0
launch__occupancy_limit_warps 16.0

This coincides with the values that I find in the SQL table regarding register usage (which is the limiter according to the data above). I see that my block size is 128 threads with 200 registers per thread. The device supports 65536 threads per block and 64 warps per SM. Thus, the occupancy by registers would be:

floor(65536 / ( 128 * 200)) / 64 * 100 = 3.1

Now, when I want to collect the achieved occupancy, I thought to use sm__maximum_warps_per_active_cycle_pct. But this has 12.5% as the value, which confuses me, as it is more than 4 times larger than my theoretical limit. I also tried to compare it to sm__warps_active.avg.pct_of_peak_sustained_active, but this is 12.1% and thus much larger as well.

Thus I’m wondering what I’m not understanding, or if the metrics I use are maybe not correct and I should use different ones. I initially used sm__warps_active.avg.pct_of_peak_sustained_elapsed for acquiring the device load, but I’m wondering now if this could also be wrong.

Occupancy, as defined here, is:

“the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. Another way to view occupancy is the percentage of the hardware’s ability to process warps that is actively in use.”

Yes correct! And the number of active warps is limited by registers, shared memory, number of threads, etc. So how would you profile these metrics (incl or excl theoretical and achieved occupancy as a whole)?

Using the definition I quoted above and using your hardware and kernel example:

maximum number of possible active warps = 64.

number of active warps per multiprocessor = 2 blocks, (limited by registers) = 8 warps

So 12.5% theoretical. The achieved is impacted by possible warp stalls/latency and so could be somewhat less.

Ah thank you!
I wasn’t aware that the occupancy metric is in terms of blocks, I directly interpreted it as number of warps.
This makes a lot of sense, and now all my numbers are coinciding.
Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.