The SASS live-register count produced when profiling a kernel reports an (incorrect?) massive spike in register usage (more than 100 registers) when reaching a CALL.ABS.NOINC instruction (as demonstrated in the attached image). The code below will demonstrate this behavior when profiled in Nsight Compute. Further, any device function decorated with __noinline__
results in the same phenomenon. Nested __noinline__
calls result in a multiplicative effect where the register count increases up to the cap of 255. The information produced by Nsight contradicts the information produced when compiling the code below using nvcc (nvcc -arch=sm_80 -Xptxas=“-v” kernel.cu) which states the kernel takes 17 registers. Is the massive spike actually occurring (and can it be removed) or is this a bug within Nsight?
#include <stdint.h>
#include <cuda_runtime.h>
extern "C" {
__global__ void kernel(float* out) {
uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
out[n] = atan2f(static_cast<float>(n), 2.0f);
}
}
int main(int argc, char const* argv[]) {
float* d_ary;
cudaMalloc(&d_ary, 32);
kernel<<<1,32>>>(d_ary);
float ary[32];
cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);
return 0;
}