Very long kernels resulting in unoptimized compilation

I’m working on a kernel that is centered around keeping a lot of simulation data in registers, and running a lot of inline code that interacts directly with those registers (like 16k lines of ptx code). I’m running into a problem trying to generate code past a certain complexity, after which the compiler decides all those registers go on the stack and all the gained efficiency goes away.

I believe the problem happens in an early stage of compilation, before eliminating impossible execution paths.

Is there some way to avoid this huge loss in efficiency?

This code is contrived and not what I’m working on… it’s meant to be a reproducible example.

const int test_size = 128;
__device__ float test_input[test_size];
__device__ float test_output;

__global__ void
test_compilation() {
    register float accum[test_size];

#pragma unroll
    for (int i = 0; i < test_size; i++) {
        accum[i] = test_input[i];
    }

    // main loop

/*
big difference between i<17357 and i<17358 ( CUDA 11.0, sm_75 target )

i<17357 uses all registers:
1>    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 168 registers, 352 bytes cmem[0]

i<17358 uses lots of local memory:
1>    512 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 34 registers, 352 bytes cmem[0]
*/

#pragma unroll
    for (int i = 0; i < 17357; i++) {
        accum[i % test_size] += accum[(i * 2) % test_size];
    }


    float total = 0;
#pragma unroll
    for (int i = 0; i < test_size; i++)
        total += accum[i];
    test_output = total;
}

The CUDA compiler has two main components that transform code: The front-end framework called NVVM which is based on the widely used LLVM, and the back-end ptxas that performs some general and all machine-specific optimizations. The interface between these two components is PTX, which is a virtual instruction set and and a compiler intermediate format. PTX never interacts directly with the hardware registers. It only uses virtual registers in SSA (single static assignment) fashion, meaning the result of every operation is assigned to a new virtual register. ptxas is responsible for allocation of physical registers, as this is GPU architecture specific.

It is difficult to form a good hypothesis based on the scant information provided, but I think it is possible that using tons of inline PTX code is a part of the problem, in that many powerful high-level optimizations otherwise performed by NVVM are impeded. Generally speaking, inline PTX code should be used sparingly, for example to access functionality not efficiently expressible at C++ level or via intrinsics.

All thread-local variables are by default assigned to local memory. ptxas decides as part of optimizations which of these should be pulled into registers. Scalars and small arrays with compile-time resolvable addressing are the usual candidates. The compiler tries to achieve the highest possible performance and considers trade-offs, e.g. massive use of registers reduces occupancy and potentially lowers performance. So if there are nested loops, it may assign variables from innermost loops to registers while variables from outermost loops remain in local memory.

The CUDA compiler is mature at this point and usually makes good decisions about register allocation. An observation that some variables are assigned to registers and others to local memory is, by itself, largely irrelevant. What is relevant is whether the choices made by the compiler negatively impact performance, and by how much. No specific performance data was mentioned in the starting post.

It is possible that optimization quality suffers for very large code. I cannot provide a crisp definition of “very large” but in general this would likely apply to single kernels consisting of several tens of thousands of lines of PTX, e.g. 40KLOCs. Programmers expect reasonable execution times from a compiler, and since the number of possible arrangements (instruction selection and ordering, register allocation) can grow rapidly with increasing code size, some optimizations phases may apply shortcuts if the compiler’s resource usage (time, memory) increases too much. This could result in less thoroughly optimized machine code.

It is also possible that your code is affected by a particular shortcoming of the compiler (inefficiency or bug) that has been addressed in the latest version of the compiler. Compiler engineers are primarily interested in issues observable with the latest shipping toolchain, CUDA 12.1 at present. So if possible, I would suggest trying that first.

I would further suggest use of the CUDA profiler to identify the bottlenecks in the code and observe whether and how bottlenecks and important performance statistics fluctuate as source code changes are made. This should result in better ideas what parts of a very large kernel may be involved in particular performance regressions.

1 Like

Compiler explorer shows no stack frame usage with cuda 12 and 20000 loop iterations.

1 Like