I have a kernel that worked pretty well with CUDA 1.0. According to the cubin it used 12 registers. When I compile the kernel with CUDA 1.1 it uses 29 registers. Therefore I have to specify the --maxrregcount option because I try to run 336 threads per block. However now that I use this option the kernel is very slow - I guess it uses local memory now.
I have a hint: I use a few arrays in constant memory and also access them via constants (I did this so that I don’t have to pass too many parameters via arguments). As I said it worked great with CUDA 1.0 but I think what happens now is the kernel uses a register for each of those parameters. That’s the only reason I can think of why my kernel would require that many registers.
I played with compiler options (-Ox) but they don’t seem to have any effect whatsoever.
This is really annoying - any help would be greatly appreciated.
Thanks in advance,