I have a kernel that worked pretty well with CUDA 1.0. According to the cubin it used 12 registers. When I compile the kernel with CUDA 1.1 it uses 29 registers. Therefore I have to specify the --maxrregcount option because I try to run 336 threads per block. However now that I use this option the kernel is very slow - I guess it uses local memory now.
I have a hint: I use a few arrays in constant memory and also access them via constants (I did this so that I don’t have to pass too many parameters via arguments). As I said it worked great with CUDA 1.0 but I think what happens now is the kernel uses a register for each of those parameters. That’s the only reason I can think of why my kernel would require that many registers.
I played with compiler options (-Ox) but they don’t seem to have any effect whatsoever.
This is really annoying - any help would be greatly appreciated.
Hi, could it be your register count is increased because of loop unrolling? Maybe there is a loop unrolled and global memory fetches issued in advance, which would definitely increase the register count. You could write
#pragma unroll 1
before each suspected for-loop to see if this was the reason.
/* note: */
#define k_incrho 1
__constant__ float f[10];
/* many registers: */
xx = fmaf(x, f[k_incrho], txx);
/* few registers: */
xx = txx + (x*f[k_incrho]);
This line actually changed from CUDA 1.0 to CUDA 1.1 so I can not say if the problem came with CUDA 1.1.
Am I doing something wrong with fmaf or how should fmaf be used?
Indeed this is a possibility. The cubin of the kernel compiled with CUDA 1.1 is about 8 times bigger than when it is compiled with CUDA 1.0. However the #pragma unroll 1 directive didn’t help. I will investigate further…
But are there compiler flags for CUDA 1.1 that force nvcc to reproduce the CUDA 1.0 output?