register count explodes with CUDA 1.1


I have a kernel that worked pretty well with CUDA 1.0. According to the cubin it used 12 registers. When I compile the kernel with CUDA 1.1 it uses 29 registers. Therefore I have to specify the --maxrregcount option because I try to run 336 threads per block. However now that I use this option the kernel is very slow - I guess it uses local memory now.

I have a hint: I use a few arrays in constant memory and also access them via constants (I did this so that I don’t have to pass too many parameters via arguments). As I said it worked great with CUDA 1.0 but I think what happens now is the kernel uses a register for each of those parameters. That’s the only reason I can think of why my kernel would require that many registers.

I played with compiler options (-Ox) but they don’t seem to have any effect whatsoever.
This is really annoying - any help would be greatly appreciated.

Thanks in advance,

Hi, could it be your register count is increased because of loop unrolling? Maybe there is a loop unrolled and global memory fetches issued in advance, which would definitely increase the register count. You could write

#pragma unroll 1

before each suspected for-loop to see if this was the reason.

UPDATE: I found the offending line:

/* note: */

#define k_incrho 1

__constant__ float f[10];

/* many registers: */

xx = fmaf(x, f[k_incrho], txx);

/* few registers: */

xx = txx + (x*f[k_incrho]);

This line actually changed from CUDA 1.0 to CUDA 1.1 so I can not say if the problem came with CUDA 1.1.

Am I doing something wrong with fmaf or how should fmaf be used?

Indeed this is a possibility. The cubin of the kernel compiled with CUDA 1.1 is about 8 times bigger than when it is compiled with CUDA 1.0. However the #pragma unroll 1 directive didn’t help. I will investigate further…

But are there compiler flags for CUDA 1.1 that force nvcc to reproduce the CUDA 1.0 output?