Too high register usage for a simple problem

sergeyn · May 3, 2014, 10:27am

Hello,
I haven’t been doing cuda for a while, and now coming back to it. I was very surprised seeing super inefficient resource usage to get my simple routine running. Consider this code:

__global__ void kernel(double* sd, const double* rk, const double* rw, int N, int ny, int slice_stride)
{
  __shared__ double rw_row[32];

  double sum = 0.0;
  
  for (int i = 32; i < N; i += 32)
  {
    if (threadIdx.x < 32)
      rw_row[threadIdx.x] = rw[blockIdx.x * N + i + threadIdx.x];

    __syncthreads();

    for (int m = 0; m < 32; ++m)
    {
      sum += rw_row[m] * rk[(i+m)*slice_stride + ny*blockIdx.x + threadIdx.x];
    }
  }
  sd[blockIdx.x*N + threadIdx.x] = sum;
}

compiled with -O3 --fmad=true for sm_35. This code ends up using 44(!!!) registers for like 3 pointers, 2 offsets and one accumulation variable. Disassembly reveals that just to do += with FMAD it uses 22 (or 11double) registers. trying to use --maxrregcount to anything smaller 44 ends up using stack. What is going on here ? 44 regs is like 75% of available registers for the simplest kernel ever ???
Please shed some light on this.

Thanks!

SPWorley · May 4, 2014, 12:41am

The compiler is likely unrolling the loop at line 14, creating a lot more code which also needs more registers for the intermediate computation.

You can suppress unrolling by using “#pragma unroll 1” just before the loop. Unrolling is usually beneficial but of course it can be tuned if it’s a problem.

Topic		Replies	Views
Why is this loop using so many registers? CUDA Programming and Performance	7	981	March 3, 2023
Register usage problem CUDA Programming and Performance	7	2896	March 27, 2009
unexpected loop unrolling CUDA Programming and Performance	2	6722	May 3, 2010
reducing the number of used registers CUDA Programming and Performance	8	6306	September 22, 2009
register count explodes with CUDA 1.1 CUDA Programming and Performance	2	7295	December 12, 2007
Number of registers per thread massively increase on loops CUDA Programming and Performance	7	4200	May 1, 2010
Register allocator overload CUDA Programming and Performance	2	3210	February 10, 2009
Reducing register usage CUDA Programming and Performance	1	1121	October 3, 2009
reduce the no of register per thread used CUDA Programming and Performance	2	2920	October 15, 2009
Incomprehendible register usage, once again CUDA Programming and Performance	3	1962	February 5, 2009

Too high register usage for a simple problem

Related topics