Instruction cache misses + Local stack and spills

I’m trying to optimize a large CDUA kernel and I’m running into instruction level cache misses. The kernel is already highly optimized with achieving about 70% SOL of compute and over 32% of peak flops on an RTX 4090. A little background: The kernel has the following generic
form:

Load data from global to shared memory
__syncthreads()
for coordinate direction  = 1,2,3
    large computation using shared memory
end for
__syncthreads()
store data in global memory

The issue I’m trying to address is that the working set for the large computation is very large and I’m getting register spilling, causing lower performance. The baseline code uses 512 threads and 64 regs and two blocks per SM. This results in a stack size of 64 bytes with 252 bytes of spills and 236 bytes stores. This amount of spilling is not terrible for performance on a 4090 due to the large L2 cache, but is much worse on GPUs with smaller L2 caches.

The idea is to half the threads and to use thread-level blocking (or “ILP”) that allows for the reuse of registers between the multiple blocked outputs thus reducing the register pressure.

Sure enough, this works exactly as expected. Sort of. If I only do one direction out of the three and compare the performance between the baseline and blocked kernel the result is promising:

baseline time = 1.0 (normalized)
blocked time = 0.82 (normalized)

Its about 18% faster which is about what I was expecting. So far so good.

When I do all 3 directions (with the direction loop unrolled) the
picture is different:

baseline time = 1.0 (normalized)
blocked time (unrolled) = 1.19 (normalized)

The blocked code is now 19% slower which is not great. NiSight Compute quickly identifies the culprit in this case: Stall no instruction.

It appears that as a result of the block loop unrolling the number of
SASS instruction has increased and I’m now having instruction cache
thrashing. The warp state stats for the block run are as follows:

So we can see the most significant stall reason is waiting for instruction. This is also confirmed by looking at the SASS where we can see periodic pauses every ~48 instruction for the new instructions to be loaded. The blocked code has a SASS size of about 16k instructions while the baseline code is only 12k instructions. I’m assuming there is instruction cache cutoff that is being crossed going between the two kernel sizes which is responsible. I’ve already tried putting in artificial __syncthreads() to try to keep the warps more together but that just increased the stall barriers and the overall run time. From what I see in the sass, the instruction stall occur during very large runs of 100’s of straight-line FFMA, FMUL, FADD instructions. There is also a good deal of ILP within these runs so all operands are in registers so they can be executed extremely quickly.

The obvious solution is to not unroll the direction loop which cuts the number of SASS instructions back to 12k and the “stall no instruction” warp state is much less significant.

The issue with this approach is that the compiler adds 336(!) bytes of stack space with 100 bytes of spills and 116 bytes of loads unnecessarily. We know the spills and stack is not necessary since
unrolled code doesn’t need any spills (The blocked kernel has access to 128 regs, but can use as low as 96 regs without spilling). So we know these spills and stack are not necessary.

Needless to say this variant is worse, with the following performance:

baseline time = 1.0 (normalized)
blocked time (no unroll) = 1.52 (normalized)

NSight Compute confirms that massive increase in L1<->L2 traffic due to the STL and LDL instructions. The memory workload analysis confirms near 100% usage of the L2<->L1 pipeline.

The question is how to make the compiler call a large function without generating necessary and copious amounts of stack/spills?

A separate experiment where 256 registers are available, still results in 187 registers used and stack size of 232 bytes (but no spills). There are literally enough registers space left = (256-187)*4 = 276 bytes to contain the stack of 232 bytes.

Why is the stack size and register usage so different when inlining vs calling a large function?

This has been a highly frustrating optimization since we know the blocked code is better and it is theoretically possible for the non-loop unrolled version to be faster than the baseline, the compiler
just can’t generate the code to do it.

Is there anywhere to go from here or this is simply the end of optimizations that can be done without writing ptx/sass by hand?

Thanks,
Gaetan

If you haven’t already seen it, this may offer something:

Thanks,

Yes, I’ve already read that thoroughly.
Gaetan

Can you restructure the code in the following way?

float bigregarray[64];
#pragma unroll 1 // deactivate unrolling
for coordinate direction = 1,2,3
    switch(coordinate direction) {
    case 1:
        bigregarray[i] = sharedarray[i*2];
    case 2:
        bigregarray[i] = sharedarray[i*3];
    case 3:
        bigregarray[i] = sharedarray[i*i*4-2];
    }
    large computation on bigregarray
    result[1000 * coordinate direction + i] = sharedarray[i];
}

The idea is not to use any dynamic or thread dependent indices, except in unrolled loops.
The for loop is not unrolled, but inside the for loop, there can be loops over i, which are unrolled.
To differentiate between the coordinates without making the indices dynamic, we have a loading code, which is specific to the coordinate (switch…case); the actual computation code is independent of the coordinate.

So the loading code is as bulky as in the unrolled case, the computation code is stored only once.

That is actually fairly close to what is done. There is a switch for each direction to load thread-local data and then do the large operations on in. The issue is all the unnecessary stack frame pushes that result from not unrolling everything.

Thanks,
Gaetan

Use 256 registers per thread to see, whether the spilling comes from not enough registers or from missed dynamic indices. You mentioned that experiment; what variable is the stack of 232 bytes containing? Can you see from the SASS code side-by-side to C++? Have you made sure that variable is not using the direction index directly or indirectly? Or any other not unrolled loop variables? Have you used #pragma unroll 1 around the direction for loop?

Optionally: Can you try making the direction loop iterator even less predictable by loading it from global memory in each iteration? (Without running the program, just seeing, whether that changes something.)

Thanks for the suggestion.

I think you might be onto something. I have a few arrays that look like this:

 constexpr int eye[9] = {1, 0, 0, 0, 1, 0, 0, 0, 1};

That was being pushed to the local stack which makes absolutely no sense. However, I am indexing
using something like eye[3*dir + 0] which is fine when dir is a compile time constant, but not when it is dynamic. That might explain why the code was pushing constexpr values to the stack.

I saw this instruction that from what I can tell is storing the explicit zero register (from eye) on the local stack!
STL.64 [R1+0x8], RZ

I’ll see if removing the dynamic indexing into these arrays help bring down the stack size.

Thanks,
Gaetan

It is possible to use indices for global memory, shared memory and constant memory.

For registers you can use conditionals.

Instead of
eye[3*dir + 0]

you could directly use:
dir == 0.

Perhaps even
dir == 0 ? eye[3*0 + 0] : (dir == 1 ? eye[3*1 + 0] : eye[3*2 + 0])
would give good code.
With
__builtin_assume(dir == 0 || dir == 1 || dir == 2);

Alternatively exchanging the constexpr with __constant__ memory could be enough.

Or in this case trying to find out, if there are optimizations possible with the ones and zeroes, which are perhaps used as factors later on.

Or besides loading the eye value in a switch…case at the beginning of each loop, you can also use a rotation over all three coordinates in each iteration. That would shorten the program as it is the same for all iterations.

Hi Curefab

Thanks so much for suggestion! It turns out dynamic indexing of the “eye” array and friends was causing the stack pushes. Once I tracked all them down and replaced with dir==0 (and equivalent) the stack size dropped to 0 with essentially 0 spills. Along with another optimization for more register reuse among between the blocked outputs, the new code is about 25% faster than the baseline. Success!

Thanks for your help!
Gaetan

1 Like