Registers usage behaviour

I’m writing a 1D convolution kernel. This is my initial code

__global__ void conv1d(const float *A, const float* B, float *C, size_t N, size_t K) {
    const int tid = threadIdx.x;
    float res[COARSE_FACTOR] = {0};
    const int index = tid + blockIdx.x * BLOCK_SIZE * COARSE_FACTOR;
    const int COARSE_BLOCK_SIZE = BLOCK_SIZE * COARSE_FACTOR;
    __shared__ float sA[COARSE_BLOCK_SIZE * 2], sB[COARSE_BLOCK_SIZE * 2];

    const int offset = -((int)K / 2);
    int a_index, b_index;

    for (int i = 0; i < (K + COARSE_BLOCK_SIZE - 1) / COARSE_BLOCK_SIZE; i++) {
        b_index = i * COARSE_BLOCK_SIZE + tid;
        a_index = index + offset + i * COARSE_BLOCK_SIZE;

        if (i == 0) {
            for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
                int base_index = j + tid;
                int pos0 = a_index + j;
                sA[base_index] = (pos0 >= 0 && pos0 < N) ? A[pos0] : 0;
            }
            for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
                int base_index = j + tid + COARSE_BLOCK_SIZE;
                int pos0 = a_index + j + COARSE_BLOCK_SIZE;
                sA[base_index] = (pos0 >= 0 && pos0 < N) ? A[pos0] : 0;
            }
        } else {
            for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
                int base_index = tid + j;
                sA[base_index] = sA[base_index + COARSE_BLOCK_SIZE];
                int pos = a_index + COARSE_BLOCK_SIZE + j;
                sA[base_index + COARSE_BLOCK_SIZE] = (pos < N) ? A[pos] : 0;
            }
        }

        for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
            int pos = b_index + j;
            sB[tid + j] = (pos < K) ? B[pos] : 0;
        }
        
        __syncthreads();
        
        for (int k = 0; k < COARSE_FACTOR; k++) {
            for (int j = 0; j < COARSE_BLOCK_SIZE; j++) {
                res[k] += sA[j + k * BLOCK_SIZE + tid] * sB[j];
            }
        }

        __syncthreads();
    }

    for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
        if (index + j < N)
            C[index + j] = res[j / BLOCK_SIZE];
    }
}

The code uses thread coarsening, where each thread calculates COARSE_FACTOR results with stride BLOCK_SIZE. I then modified the code so that each thread calculates COARSE_FACTOR consecutive results:

// New code for calculating results
for (int k = 0; k < COARSE_FACTOR; k++) {
    for (int j = 0; j < COARSE_BLOCK_SIZE; j++) {
        res[k] += sA[j + k + COARSE_FACTOR * tid] * sB[j];
    }
}
// New code for saving calculated results
for (int j = 0; j < COARSE_FACTOR; j++) {
    int index = COARSE_BLOCK_SIZE * blockIdx.x + tid * COARSE_FACTOR + j;
    if (index < N)
        C[index] = res[j];
}

To my surprise, the modified code run around 50% faster on my device (RTX 3060).

Profiling two codes above with ncu results in following:
Initial code:

Registers Per Thread             register/thread             242
Block Limit Registers                 block            8
Block Limit Shared Mem                block           12
Theoretical Active Warps per SM        warp            8
Theoretical Occupancy                     %        16.67
Achieved Occupancy                        %        16.14
Achieved Active Warps Per SM           warp         7.75

And the modified code:

Registers Per Thread             register/thread              40
Block Limit Registers                 block           48
Block Limit Shared Mem                block           20
Theoretical Active Warps per SM        warp           16
Theoretical Occupancy                     %        33.33
Achieved Occupancy                        %        31.22
Achieved Active Warps Per SM           warp        14.99

The modified code somehow reduces the register usage to 40 regs/thread, which also leads to increased occupancy. I also notice that if I increase the COARSE_FACTOR in the modified code pass a certain threshold (10 on my device, which I think is dictated by the maximum shared memory per MP), the registers usage jump to 255 regs/thread.

Can someone explain what is this behavior? And where I can read more about it. Thank you.

As it seems COARSE_FACTOR is known during compilation, did you try to enforce loop unrolling (or opposite, block it) and see the difference?

For low COARSE_FACTOR compiler is probably able to use separate registers for each value of res[COARSE_FACTOR]; with large arrays it will probably spill into local memory.

As multiplications could be expensive: (k * BLOCK_SIZE) must be calculated in each iteration, COARSE_FACTOR*COARSE_BLOCK_SIZE times; while in the second version of the code (COARSE_FACTOR * tid) could be calculated only once.
The same for the other loop: (j / BLOCK_SIZE) is expensive, even if BLOCK_SIZE is fixed and known at compile time.

By the way, do you see any difference if you replace BLOCK_SIZE in index calculation with a blockDim.x?

Hi, Thank you for your suggestions.

I tried limiting and blocking loop unrolling on both versions. The performance degraded proportionally to the limitation.

In the first version, completely blocking unrolling reduces register usage to 40. Register usage in this version also seems to be proportional to COARSE_FACTOR.

But COARSE_FACTOR here is relatively small, I don’t understand why small increase (8 → 11) could cause registers count from 40 to 255 then spilled to local memory.

I think for the first version, the k * BLOCK_SIZE expression is calculated only COARSE_FACTOR times, because it does not change in the inner loop.

And because both COARSE_FACTOR and BLOCK_SIZE are both exponent of 2, multiplications and divisions should be translated to bit shift. I tried manually replacing them with bit shift and there was no difference in performance.

I guess the difference between two version is mainly due to the increased occupancy.

Replacing BLOCK_SIZE with blockDim.x degraded performance around ~10-15% on my device.

The relevant part of the code appears to be this:

        __syncthreads();
//#pragma unroll 1
        for (int k = 0; k < COARSE_FACTOR; k++) {
            for (int j = 0; j < COARSE_BLOCK_SIZE; j++) {
                res[k] += sA[j + k * BLOCK_SIZE + tid] * sB[j];
            }
        }
        __syncthreads();

This part is also easy to identify in the generated code because it is framed by the synchronization primitives:bar.sync at PTX level and BAR.SYNC.DEFER_BLOCKING at SASS level. In addition, one can specify -lineinfo on the nvcc commandlineto match source lines with generated code.

I don’t know what their actual values are, but when I looked at this using CUDA 12.5 I used

#define COARSE_FACTOR     (8)
#define BLOCK_SIZE        (128)

For an sm_86 target, using -Xptxas -v, I then see these stats:

 8800 bytes stack frame, 10036 bytes spill stores, 23096 bytes spill loads
 Used 255 registers, 8800 bytes cumulative stack size

The compiler completely blows the register budget and spills heavily, leading to poor performance. The proximate cause seems to be the unroller inside ptxas going haywire, unrolling this loop nest of depth 2 completely. Obviously this results in a massive linear block of code, but why this block also needs so many registers is not immediately clear from looking at SASS. From a cursory look it seems to split the dot product into multiple partial ones? There seem to be mechanical transformations other than pure unrolling at work.

The PTX intermediate representation looks more reasonable: It unrolls the inner loop partially, by a factor of 16, and the outer loop completely, presumably since COARSE_FACTOR is “small” and compile-time constant. I am not sure unrolling the outer loop here makes much sense but it might provide epsilon benefit.

By preventing the outer loop from being unrolled (see commented-out #pragma) the generated code looks much more reasonable, as does the register usage:

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 40 registers

The inner loop is still getting unrolled aggressively, so performance should be good.

If this problem persists with the latest compiler (CUDA 12.8.1) I would recommend filing a bug with NVIDIA. The unroller heuristics in ptxas need some re-work to avoid tanking performance by introducing massive spilling.

1 Like

I was using nvcc 12.6, and blocking the outer loop from unrolling did indeed reducing the register usage but the performance is still worse than the first version.

After upgrading to the latest compiler version, both codes now performing virtually the same without manually interfering with the compiler’s unroll decision. Thank you so much!

@njuffa Seem like I messed up the benchmark in the last reply, since I used the old driver. New compiler (release 12.8, V12.8.93) does not resolve the issue. I will file a report to NVIDIA.

Thanks for filing a ticket . Just try to link the ticket , this corresponds to NVBUG ID 5217221 . It is under review and we will interact with requester and bring back conclusion here .