Understanding degraded kernel performance with higher occupancy

Apologies in advance if this is too long of a read.

I have been working on addressing register pressure in a GPU kernel. I have been using Nsight Compute to understand the performance of the kernels. For a kernel I am working with on A100 GPUs with CUDA 12.2, I made code modifications and reduced the register pressure by 4x and increased the occupancy 4x. However, that still led to the “register optimized” version being 60% slower than the baseline kernel. I have exported the Nsight Compute data and attached it.

Major code changes going from baseline to “register optimized”:

  1. Moved data from thread private memory to shared memory
  2. The baseline evaluates 1 data point per thread, whereas the optimized version evaluated 1 data point per warp. So the CUDA launch grid sizes are different between the two version of the code, but both kernel versions evaluate the same number of data points.

My understanding is that my code changes affected global memory coalescing which caused long enough stalls to outweigh the advantages gained from increased occupancy. However, I am not an expert yet, and wanted to lean on the expertise in this forum to better comprehend the profiling results.

I had the following questions -

  1. The instructions increased across the board, but some warp stalls increased while others decreased. Are there other metrics I should be looking at to better make sense of why my “register optimized” version of the code did worse with better occupancy?

  2. Considering that some warp stalls increased and others decreased, I am guessing that there is a notion of “some warp stalls are worse than others”? Is that true, and if so, could someone help elaborate on this?

  3. Considering that my occupancy increased 4x, what would be the best way to mathematically make sense of what I am seeing? Something along the lines of – instructions/warp stalls need to change by a specific amount for me to theoretically break even with the baseline kernel. Does this make sense?

  4. One of my code changes assumes there is enough work per data point for it to be distributed across a warp (as opposed to evaluating one data point per thread). Is there anything in the profiler that could confirm or reject this assumption?

Please note, I am unable to share the code here, and hence not expecting a solution to speeding up my code. I am only trying to understand the profiling data based on the above questions, and hopefully pinpoint the new bottleneck.

The Nsight Compute diff between the two versions of the kernel:
diff.pdf (1.9 MB)

Memory table diffs:


Comments from a somewhat limited perspective, (mostly gathered across a small number of kernels) and not being able to view the stall distribution across instructions, (Source View).

Your revised kernel is memory limited, to a large degree due to the added shared memory component. One area which would be worth targeting, would be to reduce shared memory conflicts, which will be contributing to short scoreboard stalls, which is now the dominant reason for stalls.

1 Like

The memory table indicates that there are no shared memory bank conflicts. The short scoreboard stalls probably show that there is not enough computation between two loads to effectively hide the shared memory latency.

The optimized kernel executes way more instructions (+650%) , especially memory instructions. This over-utilizes the memory pipelines. The instructions queues are full so the warp needs to block until it can issue its instruction (MIO throttle and LG throttle). You could try to switch to wider loads, for example 16-byte instead of 8-byte (double2 instead of double)

1 Like

Am I misinterpreting the “Shared Memory” Bank Conflicts column, with a total of 67,878?

I think they meant compared to the instructions and requests, the bank conflicts are minimal? I didn’t think much of the conflicts because they are so small compared to the total requests.

1 Like