Very long kernels resulting in unoptimized compilation

deletecode · March 10, 2023, 8:34am

I’m working on a kernel that is centered around keeping a lot of simulation data in registers, and running a lot of inline code that interacts directly with those registers (like 16k lines of ptx code). I’m running into a problem trying to generate code past a certain complexity, after which the compiler decides all those registers go on the stack and all the gained efficiency goes away.

I believe the problem happens in an early stage of compilation, before eliminating impossible execution paths.

Is there some way to avoid this huge loss in efficiency?

This code is contrived and not what I’m working on… it’s meant to be a reproducible example.

const int test_size = 128;
__device__ float test_input[test_size];
__device__ float test_output;

__global__ void
test_compilation() {
    register float accum[test_size];

#pragma unroll
    for (int i = 0; i < test_size; i++) {
        accum[i] = test_input[i];
    }

    // main loop

/*
big difference between i<17357 and i<17358 ( CUDA 11.0, sm_75 target )

i<17357 uses all registers:
1>    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 168 registers, 352 bytes cmem[0]

i<17358 uses lots of local memory:
1>    512 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 34 registers, 352 bytes cmem[0]
*/

#pragma unroll
    for (int i = 0; i < 17357; i++) {
        accum[i % test_size] += accum[(i * 2) % test_size];
    }


    float total = 0;
#pragma unroll
    for (int i = 0; i < test_size; i++)
        total += accum[i];
    test_output = total;
}

njuffa · March 10, 2023, 9:25am

The CUDA compiler has two main components that transform code: The front-end framework called NVVM which is based on the widely used LLVM, and the back-end ptxas that performs some general and all machine-specific optimizations. The interface between these two components is PTX, which is a virtual instruction set and and a compiler intermediate format. PTX never interacts directly with the hardware registers. It only uses virtual registers in SSA (single static assignment) fashion, meaning the result of every operation is assigned to a new virtual register. ptxas is responsible for allocation of physical registers, as this is GPU architecture specific.

It is difficult to form a good hypothesis based on the scant information provided, but I think it is possible that using tons of inline PTX code is a part of the problem, in that many powerful high-level optimizations otherwise performed by NVVM are impeded. Generally speaking, inline PTX code should be used sparingly, for example to access functionality not efficiently expressible at C++ level or via intrinsics.

All thread-local variables are by default assigned to local memory. ptxas decides as part of optimizations which of these should be pulled into registers. Scalars and small arrays with compile-time resolvable addressing are the usual candidates. The compiler tries to achieve the highest possible performance and considers trade-offs, e.g. massive use of registers reduces occupancy and potentially lowers performance. So if there are nested loops, it may assign variables from innermost loops to registers while variables from outermost loops remain in local memory.

The CUDA compiler is mature at this point and usually makes good decisions about register allocation. An observation that some variables are assigned to registers and others to local memory is, by itself, largely irrelevant. What is relevant is whether the choices made by the compiler negatively impact performance, and by how much. No specific performance data was mentioned in the starting post.

It is possible that optimization quality suffers for very large code. I cannot provide a crisp definition of “very large” but in general this would likely apply to single kernels consisting of several tens of thousands of lines of PTX, e.g. 40KLOCs. Programmers expect reasonable execution times from a compiler, and since the number of possible arrangements (instruction selection and ordering, register allocation) can grow rapidly with increasing code size, some optimizations phases may apply shortcuts if the compiler’s resource usage (time, memory) increases too much. This could result in less thoroughly optimized machine code.

It is also possible that your code is affected by a particular shortcoming of the compiler (inefficiency or bug) that has been addressed in the latest version of the compiler. Compiler engineers are primarily interested in issues observable with the latest shipping toolchain, CUDA 12.1 at present. So if possible, I would suggest trying that first.

I would further suggest use of the CUDA profiler to identify the bottlenecks in the code and observe whether and how bottlenecks and important performance statistics fluctuate as source code changes are made. This should result in better ideas what parts of a very large kernel may be involved in particular performance regressions.

striker159 · March 10, 2023, 9:34am

Compiler explorer shows no stack frame usage with cuda 12 and 20000 loop iterations.

Topic		Replies	Views
Ptxas compiler speed. CUDA Programming and Performance	23	12172	December 20, 2012
why adding 1 line =exploding time to compile CUDA Programming and Performance	13	8448	June 8, 2009
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	330	April 3, 2024
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3181	June 3, 2015
Why is this loop using so many registers? CUDA Programming and Performance	7	987	March 3, 2023
The number of registers used in a kernel and the performance are related to the way the local variables are processed？ CUDA Programming and Performance	4	548	March 1, 2020
Anyone help me with the link to the register usage break points? CUDA Programming and Performance	4	329	July 14, 2022
Anyway to reduce stack frame usage CUDA Programming and Performance	8	1363	November 29, 2018
Strange performance behavior CUDA Programming and Performance	20	18362	May 11, 2011
BUG: Broken register allocation, toolkit 2.3 CUDA Programming and Performance	15	6914	May 10, 2010

Very long kernels resulting in unoptimized compilation

Related topics