Apologies in advance if this is too long of a read.
I have been working on addressing register pressure in a GPU kernel. I have been using Nsight Compute to understand the performance of the kernels. For a kernel I am working with on A100 GPUs with CUDA 12.2, I made code modifications and reduced the register pressure by 4x and increased the occupancy 4x. However, that still led to the “register optimized” version being 60% slower than the baseline kernel. I have exported the Nsight Compute data and attached it.
Major code changes going from baseline to “register optimized”:
- Moved data from thread private memory to shared memory
- The baseline evaluates 1 data point per thread, whereas the optimized version evaluated 1 data point per warp. So the CUDA launch grid sizes are different between the two version of the code, but both kernel versions evaluate the same number of data points.
My understanding is that my code changes affected global memory coalescing which caused long enough stalls to outweigh the advantages gained from increased occupancy. However, I am not an expert yet, and wanted to lean on the expertise in this forum to better comprehend the profiling results.
I had the following questions -
-
The instructions increased across the board, but some warp stalls increased while others decreased. Are there other metrics I should be looking at to better make sense of why my “register optimized” version of the code did worse with better occupancy?
-
Considering that some warp stalls increased and others decreased, I am guessing that there is a notion of “some warp stalls are worse than others”? Is that true, and if so, could someone help elaborate on this?
-
Considering that my occupancy increased 4x, what would be the best way to mathematically make sense of what I am seeing? Something along the lines of – instructions/warp stalls need to change by a specific amount for me to theoretically break even with the baseline kernel. Does this make sense?
-
One of my code changes assumes there is enough work per data point for it to be distributed across a warp (as opposed to evaluating one data point per thread). Is there anything in the profiler that could confirm or reject this assumption?
Please note, I am unable to share the code here, and hence not expecting a solution to speeding up my code. I am only trying to understand the profiling data based on the above questions, and hopefully pinpoint the new bottleneck.
The Nsight Compute diff between the two versions of the kernel:
diff.pdf (1.9 MB)
Memory table diffs: