Registers, spilling and local memory - Unexpected behaviour?

Good morning everyone,

So recently I have been debugging a code of mine in which I try to optimize the GPU usage by allocating information into registers. I pay attention to the NVCC output so that spilling is 0 and there is not any allocation to local memory whatsoever. However, when looking at Nvidia Nsight Compute, it is saying me that there are loads of information that are being transfered to local memory instead of using registers. As the NVCC tells me that there is no spilling/local memory allocation, why is this happening? I created small algorithm that replicates this issue, in which the NVCC tells you that there has been 0 spills, with no reference to local memory, but then the Nsight Compute shows you that there’s been spilling to local memory.

The algorithm is below:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>


__global__ void addKernel(float*a, float*b)
{
    register float TData[8*36];
    int warpID = (threadIdx.x >> 5);
    for (int i = 0; i < (8 * 36); i++)
    {
        TData[i] = a[i+(8 * 36* warpID)];
    }
    for (int j = 0; j < (8 * 36); j++) {
        TData[j] = TData[j] * 1.24;
    }
    for (int z = 0; z < (8 * 36); z++)
    {
        b[z + (8 * 36* warpID)] = TData[z];
    }
}

int main()
{
    const int arraySize = 8*36*4;
    float a[arraySize] = { 1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,
    1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36 ,
    1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36 ,
    1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36 };
    float*d_a, *d_b;
    float b[arraySize];
    cudaMalloc(&d_a, arraySize * sizeof(float));
    cudaMalloc(&d_b, arraySize * sizeof(float));
    cudaMemcpy(d_a, a, arraySize * sizeof(float), cudaMemcpyHostToDevice);
    addKernel << <1, 128,0,0 >> > (d_a, d_b);
    cudaMemcpy(b, d_b, arraySize * sizeof(float), cudaMemcpyDeviceToHost);
    return 0;
}

The GPU is the RTX2060 and it’s running on CUDA 10.2. If you need any more information I will be pleased to provide it.

Thanks for any help you can bring!

The register keyword is merely a suggestion to the compiler

The TData array is too big to be kept in registers (definitely requires more than 255 regs per thread). The compiler places this in local memory. It’s not a register spill in the traditional sense.

If the array was small enough to fit into registers, the following will apply:

Since registers are not indexable in hardware instructions, the compiler must be able to determine the indices at compile time. This means it is required to fully unroll the for loops with the loop index i, j, z. Use a #pragma unroll statement with an explicitly given trip count that is matching the number of iterations. Only then the array has a chance to live in registers.

Thanks for answering cbuchner1 :)

I know the register keyword is just a hint for the compiler, and indeed, 255 regs per thread is the maximum, and 368 is a little bit over that value. However, I’ve just tried with 308, which is 240 registers per thread and it still happens. And, 240*128 is 30720, which is below as far as I know the threshold of registers per SM in Turing.

Nonetheless, I have tried the #pragma unroll statement you mentioned and now it effectively allocates the registers needed, and there is no usage at all of the local memory! Thank you very much for the tip!!

the drawback is that 240+ registers per thread won’t allow you to launch large blocks anymore.

And if you restrict the number of registers per thread to say 128 or 64 (kernel specific with launch_bounds, or cu module specific with the maxrregcount compiler option) you would see spilling to local memory again.

Check if shared memory might be an alternative to store some of your array data.