Too high register usage for a simple problem

I haven’t been doing cuda for a while, and now coming back to it. I was very surprised seeing super inefficient resource usage to get my simple routine running. Consider this code:

__global__ void kernel(double* sd, const double* rk, const double* rw, int N, int ny, int slice_stride)
  __shared__ double rw_row[32];

  double sum = 0.0;
  for (int i = 32; i < N; i += 32)
    if (threadIdx.x < 32)
      rw_row[threadIdx.x] = rw[blockIdx.x * N + i + threadIdx.x];


    for (int m = 0; m < 32; ++m)
      sum += rw_row[m] * rk[(i+m)*slice_stride + ny*blockIdx.x + threadIdx.x];
  sd[blockIdx.x*N + threadIdx.x] = sum;

compiled with -O3 --fmad=true for sm_35. This code ends up using 44(!!!) registers for like 3 pointers, 2 offsets and one accumulation variable. Disassembly reveals that just to do += with FMAD it uses 22 (or 11double) registers. trying to use --maxrregcount to anything smaller 44 ends up using stack. What is going on here ? 44 regs is like 75% of available registers for the simplest kernel ever ???
Please shed some light on this.


The compiler is likely unrolling the loop at line 14, creating a lot more code which also needs more registers for the intermediate computation.

You can suppress unrolling by using “#pragma unroll 1” just before the loop. Unrolling is usually beneficial but of course it can be tuned if it’s a problem.