A strange phenomenon on register allocation. How to reduce register pressure?

These days I have tried a lot of ways to get down the register pressure but still no progress. Besides, I find the register allocation of the nvcc compiler is totally incomprehensible in that changing two single lines of my code results in significantly more register consumption. So I decide to put all details here, and I really need help to fight against register pressure for my case.

I provide a minimum reproducible code below. The structure of the code is to some extent similar to matrix multiplication (e.g., A * B + C), where the program first reads C, then continuously reads a piece of inputs (A and B) of size 16 and stores it into shared memory, and then calculates the output and adds to the final answer. In the code I create a template kernel which is then specialized into two versions. The only difference between these two versions lies in the ways of calculating the address offset of the input pointer. All the other codes are the same except for two lines: line 72 and line 131. However, these two lines are not inside the main loop and is executed only at initialization and in the end. So we should expect that the performance of the two kernels are similar. However, after compilation, kernel 1 has 64 registers while kernel 2 has 96! I can’t understand it at all why the difference of these two lines results in such a huge gap of register allocation. In my opinion, the register bottleneck should be inside the main loop where huge computation is needed.

Consequently, kernel 1 is 50% faster than kernel 2, despite the computation of the two kernels are exactly the same, and the only difference is in fetching the input. I have tried a lot in order to get down the register pressure for kernel 2, but still no progress. Using --max-reg-count or launch bound does not help, even if they can reduce registers. After many days’ struggling I finally have to come here for help. Really appreciate if someone can help me make the kernel faster. Thanks a lot!

#include <type_traits>

const int BLOCK_SIZE = 16;
#define CONST_PTR const float* __restrict__
#define PTR float* __restrict__

struct FCInput {
    __device__ int get_offset(int G, int C, int HW, int b, int g, int c, int hw, bool& success) {
        success = true;
        return ((b * G + g) * C + c) * HW + hw;
    }
};

template <int K, int P, bool near_padding>
struct ConvInput {
    int H, W, S, WO;
    __device__ int get_offset(int G, int C, int HW, int b, int g, int c, int hw, bool& success) {
        int offset_bc = ((b * G + g) * C + c) / (K * K);
        int h = hw / WO, w = hw - h * WO;
        int offset_h = h * S + c / K % K - P, offset_w = w * S + c % K - P;
        if (P > 0 && near_padding) {
            offset_w = max(min(offset_w, W - 1), 0);
            offset_h = max(min(offset_h, H - 1), 0);
        }
        success = near_padding || (offset_w >= 0 && offset_w < W && offset_h >= 0 && offset_h < H);
        return (offset_bc * H + offset_h) * W + offset_w;
    }
};

template <typename Input>
__device__ __forceinline__ bool is_conv(Input i_conf) { return !std::is_empty<Input>::value; }

__device__ __forceinline__
float update_backward(float x, float w, float b, float o, float g) {
    return g * exp2f(w * x + b - o);
}

template <int GROUP_CI, int GROUP_B, bool has_hw, bool check_co, typename Input> __global__
void logsumexp_backward_input_kernel(CONST_PTR grad_output, CONST_PTR input, CONST_PTR weight, CONST_PTR bias,
                                     CONST_PTR output, int B, int CO_div_G, int CI_div_G, int HW, int G, PTR grad_input,
                                     Input i_conf) {
    if (!has_hw) HW = 1;

    int b_hw = blockIdx.x * (BLOCK_SIZE * GROUP_B) + (has_hw ? threadIdx.x : threadIdx.y);
    int b[GROUP_B], hw[GROUP_B];
    #pragma unroll
    for (int i = 0; i < GROUP_B; i++) {
        int b_hw_i = b_hw + i * BLOCK_SIZE;
        if (has_hw) { b[i] = b_hw_i / HW; hw[i] = b_hw_i % HW; }
        else { b[i] = b_hw_i; hw[i] = 0; }
    }

    __shared__ float blockO[GROUP_B][BLOCK_SIZE][BLOCK_SIZE]; // CO * B if has_hw else B * CO
    __shared__ float blockG[GROUP_B][BLOCK_SIZE][BLOCK_SIZE]; // CO * B if has_hw else B * CO
    __shared__ float blockW[GROUP_CI][BLOCK_SIZE][BLOCK_SIZE]; // CO * CI
    __shared__ float blockB[GROUP_CI][BLOCK_SIZE][BLOCK_SIZE]; // CO * CI

    float res[GROUP_B][GROUP_CI], x[GROUP_B][GROUP_CI];
    #pragma unroll
    for (int i = 0; i < GROUP_B; i++) {
        #pragma unroll
        for (int j = 0; j < GROUP_CI; j++)
            res[i][j] = 0;
    }
    #pragma unroll
    for (int i = 0; i < GROUP_B; i++) {
        int write_ci = blockIdx.y * (BLOCK_SIZE * GROUP_CI) + (has_hw ? threadIdx.y : threadIdx.x);
        #pragma unroll
        for (int j = 0; j < GROUP_CI; j++) {
            int channel = write_ci + j * BLOCK_SIZE;
            bool success;
            int offset = i_conf.get_offset(G, CI_div_G, HW, b[i], blockIdx.z, channel, hw[i], success);
            x[i][j] = success && b[i] < B && channel < CI_div_G ? input[offset] : 0;
        }
    }

    for (int k = 0; k < CO_div_G; k += BLOCK_SIZE) {
        #pragma unroll
        for (int i = 0; i < GROUP_B; i++) {
            int channel = k + (has_hw ? threadIdx.y : threadIdx.x);
            if (b[i] < B) {
                int output_offset = ((b[i] * G + blockIdx.z) * CO_div_G + channel) * HW + hw[i];
                float value_o = check_co && channel >= CO_div_G ? 1e10f : output[output_offset];
                float value_g = check_co && channel >= CO_div_G ? 0 : grad_output[output_offset];
                blockO[i][threadIdx.y][threadIdx.x] = value_o;
                blockG[i][threadIdx.y][threadIdx.x] = value_g;
            }
        }
        int read_w_ci = blockIdx.y * (BLOCK_SIZE * GROUP_CI) + threadIdx.x;
        #pragma unroll
        for (int i = 0; i < GROUP_CI; i++) {
            int in_channel = read_w_ci + i * BLOCK_SIZE;
            int out_channel = k + threadIdx.y;
            if (in_channel < CI_div_G) {
                int w_offset = (blockIdx.z * CO_div_G + out_channel) * CI_div_G + in_channel;
                if (check_co) {
                    blockW[i][threadIdx.y][threadIdx.x] = out_channel < CO_div_G ? weight[w_offset] : 0;
                    blockB[i][threadIdx.y][threadIdx.x] = out_channel < CO_div_G ? bias[w_offset] : 0;
                }
                else {
                    blockW[i][threadIdx.y][threadIdx.x] = weight[w_offset];
                    blockB[i][threadIdx.y][threadIdx.x] = bias[w_offset];
                }
            }
        }
        __syncthreads();
        #pragma unroll
        for (int t = 0; t < BLOCK_SIZE; t++) {
            #pragma unroll
            for (int i = 0; i < GROUP_B; i++) {
                #pragma unroll
                for (int j = 0; j < GROUP_CI; j++) {
                    float g = has_hw ? blockG[i][t][threadIdx.x] : blockG[i][threadIdx.y][t];
                    float w = blockW[j][t][has_hw ? threadIdx.y : threadIdx.x];
                    float bias = blockB[j][t][has_hw ? threadIdx.y : threadIdx.x];
                    float o = has_hw ? blockO[i][t][threadIdx.x] : blockO[i][threadIdx.y][t];
                    res[i][j] += update_backward(x[i][j], w, bias, o, g) * w;
                }
            }
        }
        __syncthreads();
    }
    #pragma unroll
    for (int i = 0; i < GROUP_B; i++) {
        if (b[i] < B) {
            int write_ci = blockIdx.y * (BLOCK_SIZE * GROUP_CI) + (has_hw ? threadIdx.y : threadIdx.x);
            #pragma unroll
            for (int j = 0; j < GROUP_CI; j++) {
                int channel = write_ci + j * BLOCK_SIZE;
                bool success;
                int offset = i_conf.get_offset(G, CI_div_G, HW, b[i], blockIdx.z, channel, hw[i], success);
                if (success && channel < CI_div_G) {
                    if (!is_conv(i_conf)) grad_input[offset] = res[i][j]; // Fully connected
                    else atomicAdd(&grad_input[offset], res[i][j]);
                }
            }
        }
    }
}

// A simple testing program, which tests the two instantiations.
int main() {
    int B = 512, CI = 128, CO = 128, H = 16, W = 16;
    float *grad_output, *input, *input_unfold, *weight, *bias, *output;
    cudaMallocManaged(&grad_output, B * CO * H * W * sizeof(float));
    cudaMallocManaged(&output, B * CO * H * W * sizeof(float));
    cudaMallocManaged(&input, B * CI * H * W * sizeof(float));
    cudaMallocManaged(&input_unfold, B * CI * 3 * 3 * H * W * sizeof(float));
    cudaMallocManaged(&weight, CO * CI * 3 * 3 * sizeof(float));
    cudaMallocManaged(&bias, CO * CI * 3 * 3 * sizeof(float));
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid((B * H * W - 1) / (BLOCK_SIZE * 4) + 1, (CI * 3 * 3 - 1) / (BLOCK_SIZE * 4) + 1, 1);
    logsumexp_backward_input_kernel<4, 4, true, false><<<dimGrid, dimBlock>>>(
        grad_output, input_unfold, weight, bias, output, B, CO, CI * 3 * 3, H * W, 1, input_unfold, FCInput{});
    logsumexp_backward_input_kernel<4, 4, true, false><<<dimGrid, dimBlock>>>(
        grad_output, input, weight, bias, output, B, CO, CI * 3 * 3, H * W, 1, input, ConvInput<3, 1, false>{H, W, 1, W});
}

What is your hardware and cuda version? On a machine with Titan Xp and cuda 11.5, the second kernel, which has more registers, is the fastest. (I added the timing to your main)

// A simple testing program, which tests the two instantiations.
int main() {
	float time;
	cudaEvent_t eventA, eventB;
	cudaEventCreate(&eventA);
	cudaEventCreate(&eventB);

    int B = 512, CI = 128, CO = 128, H = 16, W = 16;
    float *grad_output, *input, *input_unfold, *weight, *bias, *output;
    cudaMallocManaged(&grad_output, B * CO * H * W * sizeof(float));
    cudaMallocManaged(&output, B * CO * H * W * sizeof(float));
    cudaMallocManaged(&input, B * CI * H * W * sizeof(float));
    cudaMallocManaged(&input_unfold, B * CI * 3 * 3 * H * W * sizeof(float));
    cudaMallocManaged(&weight, CO * CI * 3 * 3 * sizeof(float));
    cudaMallocManaged(&bias, CO * CI * 3 * 3 * sizeof(float));
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid((B * H * W - 1) / (BLOCK_SIZE * 4) + 1, (CI * 3 * 3 - 1) / (BLOCK_SIZE * 4) + 1, 1);
	cudaEventRecord(eventA);
    logsumexp_backward_input_kernel<4, 4, true, false><<<dimGrid, dimBlock>>>(
        grad_output, input_unfold, weight, bias, output, B, CO, CI * 3 * 3, H * W, 1, input_unfold, FCInput{});
	cudaEventRecord(eventB);
	cudaEventSynchronize(eventB);
	cudaEventElapsedTime(&time, eventA, eventB);
	std::cerr << "time normal input: " << time << "\n";

	cudaEventRecord(eventA);
    logsumexp_backward_input_kernel<4, 4, true, false><<<dimGrid, dimBlock>>>(
        grad_output, input, weight, bias, output, B, CO, CI * 3 * 3, H * W, 1, input, ConvInput<3, 1, false>{H, W, 1, W});
	cudaEventRecord(eventB);
	cudaEventSynchronize(eventB);
	cudaEventElapsedTime(&time, eventA, eventB);
	std::cerr << "time ConvInput: " << time << "\n";
}
nvcc -std=c++17 -arch=sm_61 -Xptxas "-v" -O3 main.cu -o main
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z31logsumexp_backward_input_kernelILi4ELi4ELb1ELb0E9ConvInputILi3ELi1ELb0EEEvPKfS3_S3_S3_S3_iiiiiPfT3_' for 'sm_61'
ptxas info    : Function properties for _Z31logsumexp_backward_input_kernelILi4ELi4ELb1ELb0E9ConvInputILi3ELi1ELb0EEEvPKfS3_S3_S3_S3_iiiiiPfT3_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 118 registers, 16384 bytes smem, 408 bytes cmem[0]
ptxas info    : Compiling entry function '_Z31logsumexp_backward_input_kernelILi4ELi4ELb1ELb0E7FCInputEvPKfS2_S2_S2_S2_iiiiiPfT3_' for 'sm_61'
ptxas info    : Function properties for _Z31logsumexp_backward_input_kernelILi4ELi4ELb1ELb0E7FCInputEvPKfS2_S2_S2_S2_iiiiiPfT3_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 76 registers, 16384 bytes smem, 393 bytes cmem[0]

time normal input: 103.969
time ConvInput: 59.6164

That’s quite strange. My GPU is RTX 3090 (Ampere) with CUDA version 11.1. I use Nsight-compute to profile, and it shows that the elapsed cycle of kernel 1 is 21,862,185 while the elapsed cycle of kernel 2 is 33,637,987.
The compilation command:
nvcc main.cu

I am not sure if cudaMallocManaged may also need a synchronize (which could explain why the first kernel needs more time).

For what it is worth, lines 72 and lines 131 in the code I copied from the forum seem identical by visual inspection. You might want to call our attention to the relevant difference.

Which one is kernel 1 and which one is kernel 2? On my machine with a Quadro RTX 4000, using the code as posted, the kernel with FCInput is 15% faster than the kernel with ConvInput. The ConvInput kernel is the one with the higher register usage.

The immediate cause for the register pressure is all the unrolling that is going on, and this is effect is much more pronounced in the ConvInput kernel on my platform. Conventional wisdom says (independent of GPUs) that unrolling outermost loops can be counterproductive.

What I would try here is first moving to a fully rolled version, by changing all the pragmas to #pragma unroll 1. This results in code with very low register pressure. Then introduce unrolling one loop at a time, generally working from innermost to outermost loops. That should let you find the best execution time based on the optimal mix of unrolling and register pressure for your platform (it may well be a sub-optimal mix for a different platform).

When everything is unrolled, as is the case with the posted code, the resulting machine code is so voluminous it would be a punishing task for someone trying to identify the connection between the source code changes and what is happening at the SASS level.

When I use cudaMalloc instead of cudaMallocManaged, the normal input now runs faster.

cudaMallocManaged:
time normal input: 103.969
time ConvInput: 59.6164

cudaMalloc:
time normal input: 38.405
time ConvInput: 64.5285

Maybe the difference between the kernels is caused by different memory access patterns, not register counts.

Different memory access pattern, coupled with different PCIe capabilities and system memory throughput when migrating pages between host and device, coupled with potential differences in the page migration mechanism between driver versions, might make cross platform performance comparisons difficult.

Is the use of cudaMallocManaged necessary, or was this just chosen for convenience?

For what it is worth, lines 72 and lines 131 in the code I copied from the forum seem identical by visual inspection. You might want to call our attention to the relevant difference.

Yes, they are identical. I mean that lines 72 and 131 are the only difference between FCInput kernel and ConvInput kernel. I just refer kernel 1 to FCInput and refer kernel 2 to ConvInput.

The immediate cause for the register pressure is all the unrolling that is going on, and this is effect is much more pronounced in the ConvInput kernel on my platform. Conventional wisdom says (independent of GPUs) that unrolling outermost loops can be counterproductive.

I only unroll those “small” loops. For example, since GROUP_B is 4 and GROUP_CI is 4 so I unroll those “small” loops. I does not unroll the outermost loop (line 77).

I had tried to remove unrolling, but I remember the performance dropped. In particular, for the most core part of the program (lines 107 - 120), removing unrolling seems to be harmful.

When everything is unrolled, as is the case with the posted code, the resulting machine code is so voluminous it would be a punishing task for someone trying to identify the connection between the source code changes and what is happening at the SASS level.

Indeed, the machine code now is very unreadable. From my practice, all loops except for lines 107 - 120 can use #pragma unroll 1 without sacrificing efficiency. Perhaps it can be easier to analyze the machine code by setting #pragma unroll 1 for these unimportant loops.

That is just to debug the kernel. Either cudaMalloc and cudaMallocManaged is OK. But I think these memory allocation process should not infect the kernel performance when profiling the kernel using Nsight Compute. Please correct me if I understand wrong.

Sorry, you lost me there. They are identical yet the only difference?

I meant the “more outer” loops. Full unrolling is usually a good idea for the deepest loop levels only and can become counterproductive if one goes beyond that (for example, on CPUs this might cause poor ICache usage).

I am not advocating removing all unrolling. I am advocating the judicious use of unrolling by unrolling only those loops where that results in a performance benefit. Just try them one by one, profiling after each change. You could experiment with partial unrolling by supplying a specific unroll factor, but in my experience the compiler usually does a better job of doing that by itself.

Here is one experiment I tried. I removed all pragmas. That resulted in nearly identical execution times for the two kernels. I then checked which loops showed performance difference from adding pragma unroll. The three-deep loop nest starting at line 107 is the only one where I see a difference (apply pragma to all three levels). Interestingly, the time for the FCInput kernel decreased, while the time for the ConvInput kernel increased.

Sorry I may not express clearly. I mean that for the two kernels FCInput and ConvInput, the difference is only in the last function parameter Input i_conf where Input is a template being FCInput or ConvInput. The usage of this parameter only appears in line 72 and line 131. So the difference between FCInput kernel and ConvInput kernel lies only in lines 72 and 131.

The strangest thing for me is that the ConvInput kernel has 32 more registers than the FCInput kernel, yet the two kernels are identical except for lines 72 and 131. Since the loop unroll is also the same for the two kernels, I do not know why ConvInput kernel needs so many registers.

I have tried removing all #pragma unroll in my program, however it seems that both kernels use more registers. The first kernel FCInput uses 72 registers now while the second ConvInput uses 108. However, before removing #pragma unroll, the first kernel FCInput uses 64 registers now while the second ConvInput uses 96. I am not sure whether this depends on GPU architecture. I use RTX 3090 (Ampere) with CUDA 11.1.

With no #unroll pragmas present, the compiler decides (based on heuristics) what loops to unroll.

Register allocation might be tuned according to which sm_ level you compile the code for.

I don’t find that surprising given that handling ConvInput is significantly more complex than handling FCInput, per the source code.

As @cbuchner1 says, without any pragmas, the compiler chooses whether to use unrolling, and if so, whether to partially or full unroll loops. And these choices can be different by target architecture and compiler version. Various posts in these forums lead me to believe that the latest CUDA compiler may be overly aggressive when it comes to loop unrolling.

In any event the compiler makes optimization decisions based on a collection of heuristics, and its objective is not low register pressure but high performance. As long as using more registers results in increased performance, that is fine. However, the CUDA compiler does not support profile-directed optimizations yet, so its heuristics have to estimate the performance effects of the trade-offs it makes, and estimates can be wrong at times.

In most cases, the compiler makes near optimal trade-offs when applying optimizations. So what I am advocating is to let the compiler do its thing (= baseline), and then build on top of that by introducing targeted source code changes (as few as possible) if that can be shown to improve performance. Introduce one change at at time, with proper profiling performed after each change.

I don’t find that surprising given that handling ConvInput is significantly more complex than handling FCInput , per the source code.

However, in my opinion the complexity of calculating indices for ConvInput is independent to the core calculation, and the registers bottleneck should lie in the main loop. The register should be reused. So I do not really understand why ConvInput needs 32 more registers.

In other words, even if we use local memory to calculate indices for ConvInput, the performance should not vary as lines 72 and 131 are only executed 4 times, which costs negligible computations. So currently the compiler makes a extremely sub-optimal choice by using 32 more registers, causing the kernel 50% slower.

However, the CUDA compiler does not support profile-directed optimizations yet, so its heuristics have to estimate the performance effects of the trade-offs it makes, and estimates can be wrong at times.

I know it might be difficult for compilers to choose a best register usage. So I wonder if that are ways to bypass the problem and can guide/instruct the compiler to do the right choice. (I have tried limiting the register count or using launch bound but they do not work). Thank you!

Speculating about this is not going to help. To decide what is reasonable requires a fairly extensive analysis of the code transformations actually taking place in the context of an understanding of how compilers work internally. While optimization bugs do occur, most of the time everything works as designed, even if the end result may not please the programmer. Heuristics cannot be 100% accurate.

Things that can happen in the presence of lots of unrolling: (1) The compiler discovers many more opportunities for CSE, and assigns these common subexpressions to additional registers (2) The compiler discovers lots of new opportunities to schedule loads early; the data from early loads is held in additional registers. Whether either of these two common scenarios apply here, I cannot say, because I am not going to spend hours analyzing SASS to confirm or refute such hypothesis. The thing to keep in mind is that relatively small changes at source level can lead to relative significant changes at machine code level, if such a small change allows the compiler to discover additional opportunities for code transformation.

I am getting the impression that there is too much focus on register usage here, given that register use is, in general, only weakly correlated with performance. And presumably high performance is the goal, regardless of how it is achieved “under the hood”. What I would suggest is starting with the compiler defaults, then using the CUDA profiler to guide targeted interventions.

Keep any source-level interventions to a minimum (e.g. avoid adding #pragma unroll to a dozen loops), as they tend to be brittle, easily becoming ineffective or even counterproductive with the next CUDA version or a new GPU architecture. Been there, done that, got the t-shirt. In that case (constant rewriting of CUDA math library code) I did not have much of a choice, as both GPU architectures and CUDA toolchain were still evolving rapidly in the early days of CUDA and required lots of handholding. These days both are more mature and change at a slower rate.