Hi,
I’m trying to profile my application using nsys and ncu to create a performance overview. I output the trace as a SQL database, which gives me some information on the kernel’s occupancy and the device’s limits. It for example gives me the number of used threads and shared memory (CUPTI_ACTIVITY_KIND_KERNEL), which I can compare to the device’s limits (TARGET_GPU_INFO).
Now I saw that ncu also provides all occupancy-related metrics pre-computed:
| Metric | Explanation |
|---|---|
launch__occupancy_limit_barriers |
Occupancy limit due to the number of used barriers. |
launch__occupancy_limit_blocks |
Occupancy limit due to maximum number of blocks managable per SM. |
launch__occupancy_limit_registers |
Occupancy limit due to register usage. |
launch__occupancy_limit_shared_mem |
Occupancy limit due to shared memory usage. |
launch__occupancy_limit_warps |
Occupancy limit due to block size. |
launch__occupancy_per_barrier_count |
Number of active warps for given barrier count. |
launch__occupancy_per_block_size |
Number of active warps for given block size. |
launch__occupancy_per_cluster_size |
Number of active clusters for given cluster size. |
launch__occupancy_per_register_count |
Number of active warps for given register count. |
launch__occupancy_per_shared_mem_size |
Number of active warps for given shared memory size. |
I had the idea to compute the theoretical occupancy as a percentage via:
min(launch__occupancy_limit_warps, launch__occupancy_limit_registers, launch__occupancy_limit_shared_mem, launch__occupancy_limit_blocks) / maxWarpsPerSM * 100
For one of my kernels, this results in 3.1 percent, build up of the following values:
| Metric | Value |
|---|---|
launch__occupancy_limit_blocks |
32.0 |
launch__occupancy_limit_registers |
2.0 |
launch__occupancy_limit_shared_mem |
6.0 |
launch__occupancy_limit_warps |
16.0 |
This coincides with the values that I find in the SQL table regarding register usage (which is the limiter according to the data above). I see that my block size is 128 threads with 200 registers per thread. The device supports 65536 threads per block and 64 warps per SM. Thus, the occupancy by registers would be:
floor(65536 / ( 128 * 200)) / 64 * 100 = 3.1
Now, when I want to collect the achieved occupancy, I thought to use sm__maximum_warps_per_active_cycle_pct. But this has 12.5% as the value, which confuses me, as it is more than 4 times larger than my theoretical limit. I also tried to compare it to sm__warps_active.avg.pct_of_peak_sustained_active, but this is 12.1% and thus much larger as well.
Thus I’m wondering what I’m not understanding, or if the metrics I use are maybe not correct and I should use different ones. I initially used sm__warps_active.avg.pct_of_peak_sustained_elapsed for acquiring the device load, but I’m wondering now if this could also be wrong.