I have some kernels with large argument lists. I got tired of wrestling with them during refactoring and bundled most of the const arguments into a single struct and passed that in instead. Kernel performance tanked, with grid duration increasing by almost 8000%.
I put 5 int arrays and 3 constants into one struct for a total of 56 bytes. As best I can find it in NSight compute, the problem seems to be mostly in LG Throttle stalls, which increased by 15,000%.
NSight blurb:
“On average, each warp of this kernel spends 149812.3 cycles being stalled waiting for the L1 instruction queue for local and global (LG) memory operations to be not full. Typically, this stall occurs only when executing local or global memory instructions extremely frequently. Avoid redundant global memory accesses. Try to avoid using thread-local memory by checking if dynamically indexed arrays are declared in local scope, of if the kernel has excessive register pressure causing by spills. If applicable, consider combining multiple lower-width memory operations into fewer wider memory operations and try interleaving memory operations and math instructions. This stall type represents about 54.1% of the total average of 276752.7 cycles between issuing two instructions.”
I’m working on making a minimum reproducible example that can show this behavior, but in the meantime - I’m trying to understand why I am having so much memory stall. I didn’t explicitly request any more data than before, but clearly something about the struct is causing way more requests from memory. My assumption is that individual addresses from the struct are being requested, then the thread tosses the rest of the addresses and has to request the entire struct again for the next address, but I don’t have a solid understanding of how CUDA instruction memory fetch works.
My GPU is a a A2000 and this is my nvcc version info:
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Putting things in a struct may hide the compiler’s ability to do certain optimizations such as using the RO cache on data.
If you did something unexpected, and increased the number of dereferences needed to access data, then that could be a contributor. For example, instead of passing a struct of pointers, passing a pointer to a struct of pointers.
Sanity check: All performance work must be based on optimized release builds. That is the case here, for both versions, correct?
I cannot formulate a hypothesis of how such a massive slowdown would be caused by simply packaging kernel arguments into a struct. I suspect other contributing factors.
Are you able to simplify your actual code down to a small kernel that reproduces the performance issue? If so, I would suggest posting that reproducer here for reference. Massive inefficiencies introduced by the fairly innocent code modification described should reflect in the generated machine code.
Fun fact: In very early CUDA, kernels were limited to a single kernel argument, so kernel arguments for pretty much all kernels had to be passed in a struct. I do not recall any massive performance improvements when this limitation was lifted prior to CUDA 1.0.
I did a cmake -DCMAKE_BUILD_TYPE=Release and set O-1 for my executable and still had this performance hit. Is there is anything else I need to do to make an optimized release build?
Couldn’t replicate the problem with a toy kernel, looking at options to clean up my kernel and have a minimal driver for it.
I don’t know what that means. I’m not a CMake expert. CMake is not a NVIDIA product. If I were looking for this sort of help, I would provide the exact commands used to build the module in question, which can be gotten from CMake verbose output. Do as you wish, of course, just making a suggestion.
If O-1 means optimization level 1, that doesn’t sound right to me, but I don’t know if CMake applies that only to host code, or device code, or both. Or even if it has anything to do with optimization level.
I built a minimal example project with my kernel but can’t reproduce the issue, even with identical PTX between my problem kernel, test kernel with struct and test kernel with args. Because I am using randomly generated dummy data instead of my actual feed data, it may be input data dependent or some other knock on effect from my driving code.
Don’t know what other steps I can take to investigate at this point.
Can you provide the kernel prototype for the struct based and the argument based? There is a big performance difference between value, struct by value, struct by pointer.
struct Parameters
{
int a;
int b;
… more members
};
global void param_by_arg(int a, int b, …) global void param_by_value(Parameters params) global void param_by_pointer(Parameters *pparams)
149812.3 cycles being stalled waiting for the L1 instruction queue for local and global (LG) memory operations to be not full
L2 hit latency is 200-400 cycles.
L2 miss to device memory latency is 400-800 cycles.
L2 miss to system memory latency is > 1000
UVM based replayable page fault can be > 50000 cycles
Either you have tremendously divergent memory accesses (e.g. 32 way divergent), moved something from constant memory to system memory, or you were using UVM and are faulting at a high rate. There is no other good explanation for the increase in cycles between instructions.
__global__ void testKer_Struct(const binningInputs input, int ** outputSections)
{
auto littleIndex = input.littleIndex;
auto sequence = input.sequence;
auto bigIndex = input.bigIndex;
auto split = input.split;
auto valA = input.valA;
auto valB = input.valB;
auto width = input.width;
auto max = input.max;
unsigned int sampleIndex = blockIdx.x * blockDim.x + threadIdx.x;
auto splitSel = split < sampleIndex;
auto outputSection_0 = outputSections[splitSel];
auto outputSection_1 = outputSections[splitSel+2];
if (sampleIndex < max)
{
// find which transition state the current sample is in by looking ahead to next chip, play with bits and assign an index
unsigned int selection = getBigSection(bigIndex[sampleIndex], sequence);
auto bigIndexVal = bigIndex[sampleIndex] + 1;
auto sequenceVal = -sequence[bigIndexVal];
auto val = sequenceVal * valA[sampleIndex];
auto bin = selection * width + littleIndex[sampleIndex];
atomicAdd(&outputSection_0[bin], val);
val = sequenceVal * valB[sampleIndex];
atomicAdd(&outputSection_1[bin], val);
}
}
__global__ void testKer_Args(
const short * valA,
const short * valB,
const unsigned short * bigIndex,
const unsigned short * littleIndex,
const short * sequence,
const unsigned int width,
const unsigned int split,
const unsigned int max, int ** outputSections)
{
unsigned int sampleIndex = blockIdx.x * blockDim.x + threadIdx.x;
auto splitSel = split < sampleIndex;
auto outputSection_0 = outputSections[splitSel];
auto outputSection_1 = outputSections[splitSel+2];
if (sampleIndex < max)
{
// find which transition state the current sample is in by looking ahead to next chip, play with bits and assign an index
unsigned int selection = getBigSection(bigIndex[sampleIndex], sequence);
// get the index used to lookup sequence value
auto bigIndexVal = bigIndex[sampleIndex] + 1;
auto sequenceVal = -sequence[bigIndexVal];
auto val = sequenceVal * valA[sampleIndex];
auto bin = selection * width + littleIndex[sampleIndex];
atomicAdd(&outputSection_0[bin], val);
val = sequenceVal * valB[sampleIndex];
atomicAdd(&outputSection_1[bin], val);
}
}
These are the struct input and arg input side by side. I can share driver code I write if desired, however…
I just tested these side by side in my full program and the arg/struct versions both have the LG stall problem, even though the identical arg based kernel performs normally in an older branch. So I probably put the blame in the wrong place. Looking in the commit history I can’t think of what would cause it, so maybe this branch is just haunted.
I wouldn’t ask for anyone else’s time on this until I can replicate the issue clearly.
You could compare the generated PTX and/or SASS code for the two branches (see njuffa’s advice from yesterday about generated machine code). Then one would see, what different code is the reason for the slowdown. Or even pinpoint the locations of the LG Stalls with Nsight Compute (it has a feature to jump to the hotspots within the code for chosen stall reasons).
Also look, whether the new version uses more (or at all) local memory due to register pressure.
This kind of stuff happens from time to time, and there is always a rational explanation. The usual approach is to do a binary search on the commits until a particular commit can be identified that introduced the regression. Then drill down on all the changes in that commit. My personal preference is to commit small incremental changes to the maximum extent feasible, so I don’t have to wade through hundreds of lines of changes when something blows up unexpectedly.
Another thing to consider is adding a performance test to your per-commit “smoke” testing, so performance regressions are caught right away.