Contraditory register count report when calling a non-inlined function

The SASS live-register count produced when profiling a kernel reports an (incorrect?) massive spike in register usage (more than 100 registers) when reaching a CALL.ABS.NOINC instruction (as demonstrated in the attached image). The code below will demonstrate this behavior when profiled in Nsight Compute. Further, any device function decorated with __noinline__ results in the same phenomenon. Nested __noinline__ calls result in a multiplicative effect where the register count increases up to the cap of 255. The information produced by Nsight contradicts the information produced when compiling the code below using nvcc (nvcc -arch=sm_80 -Xptxas=“-v” kernel.cu) which states the kernel takes 17 registers. Is the massive spike actually occurring (and can it be removed) or is this a bug within Nsight?

#include <stdint.h>
#include <cuda_runtime.h>

extern "C" {
    __global__ void kernel(float* out) {
        uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
        out[n] = atan2f(static_cast<float>(n), 2.0f);
    }
}

int main(int argc, char const* argv[]) {

    float* d_ary;
    cudaMalloc(&d_ary, 32);

    kernel<<<1,32>>>(d_ary);

    float ary[32];

    cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);

    return 0;
}

enter image description here

1 Like

Thanks for submitting this. We are aware of this bug and have it filed in our system. It should be fixed in a future version.

1 Like

Thanks for the reply @jmarusarz ! I’ll keep my eye out for the fix.

Has this been fixed? If so, which version? Thanks

Sorry. The issue hasn’t been fixed yet.