I’m trying to optimize a large CDUA kernel and I’m running into instruction level cache misses. The kernel is already highly optimized with achieving about 70% SOL of compute and over 32% of peak flops on an RTX 4090. A little background: The kernel has the following generic
form:
Load data from global to shared memory
__syncthreads()
for coordinate direction = 1,2,3
large computation using shared memory
end for
__syncthreads()
store data in global memory
The issue I’m trying to address is that the working set for the large computation is very large and I’m getting register spilling, causing lower performance. The baseline code uses 512 threads and 64 regs and two blocks per SM. This results in a stack size of 64 bytes with 252 bytes of spills and 236 bytes stores. This amount of spilling is not terrible for performance on a 4090 due to the large L2 cache, but is much worse on GPUs with smaller L2 caches.
The idea is to half the threads and to use thread-level blocking (or “ILP”) that allows for the reuse of registers between the multiple blocked outputs thus reducing the register pressure.
Sure enough, this works exactly as expected. Sort of. If I only do one direction out of the three and compare the performance between the baseline and blocked kernel the result is promising:
baseline time = 1.0 (normalized)
blocked time = 0.82 (normalized)
Its about 18% faster which is about what I was expecting. So far so good.
When I do all 3 directions (with the direction loop unrolled) the
picture is different:
baseline time = 1.0 (normalized)
blocked time (unrolled) = 1.19 (normalized)
The blocked code is now 19% slower which is not great. NiSight Compute quickly identifies the culprit in this case: Stall no instruction.
It appears that as a result of the block loop unrolling the number of
SASS instruction has increased and I’m now having instruction cache
thrashing. The warp state stats for the block run are as follows:
So we can see the most significant stall reason is waiting for instruction. This is also confirmed by looking at the SASS where we can see periodic pauses every ~48 instruction for the new instructions to be loaded. The blocked code has a SASS size of about 16k instructions while the baseline code is only 12k instructions. I’m assuming there is instruction cache cutoff that is being crossed going between the two kernel sizes which is responsible. I’ve already tried putting in artificial __syncthreads() to try to keep the warps more together but that just increased the stall barriers and the overall run time. From what I see in the sass, the instruction stall occur during very large runs of 100’s of straight-line FFMA, FMUL, FADD instructions. There is also a good deal of ILP within these runs so all operands are in registers so they can be executed extremely quickly.
The obvious solution is to not unroll the direction loop which cuts the number of SASS instructions back to 12k and the “stall no instruction” warp state is much less significant.
The issue with this approach is that the compiler adds 336(!) bytes of stack space with 100 bytes of spills and 116 bytes of loads unnecessarily. We know the spills and stack is not necessary since
unrolled code doesn’t need any spills (The blocked kernel has access to 128 regs, but can use as low as 96 regs without spilling). So we know these spills and stack are not necessary.
Needless to say this variant is worse, with the following performance:
baseline time = 1.0 (normalized)
blocked time (no unroll) = 1.52 (normalized)
NSight Compute confirms that massive increase in L1<->L2 traffic due to the STL and LDL instructions. The memory workload analysis confirms near 100% usage of the L2<->L1 pipeline.
The question is how to make the compiler call a large function without generating necessary and copious amounts of stack/spills?
A separate experiment where 256 registers are available, still results in 187 registers used and stack size of 232 bytes (but no spills). There are literally enough registers space left = (256-187)*4 = 276 bytes to contain the stack of 232 bytes.
Why is the stack size and register usage so different when inlining vs calling a large function?
This has been a highly frustrating optimization since we know the blocked code is better and it is theoretically possible for the non-loop unrolled version to be faster than the baseline, the compiler
just can’t generate the code to do it.
Is there anywhere to go from here or this is simply the end of optimizations that can be done without writing ptx/sass by hand?
Thanks,
Gaetan