Handling resources

I have a fairly complex kernel that I would like to run. Problem is that I get the message “too many resources…” after I have written about one quarter of it. After using the “–ptxas-options=-v” option I get that I am using 18 registers and ~12k shared memory.

In my case it seems like the register count that are the troublesome part. I can’t go smaller than (512, 1, 1) in block size. So that gives me a limit of 16 registers (16 x 512 = 8192, which is the limit right?).

So what can I do/think of when I am coding to keep the register count low? I use some texture-lookups that I guess that I could move to global memory, but I would like to keep them if possible. Tips and tricks anyone?

(Using CUDA 2.0 and a 8800 Ultra)

You should go lower in number of threads per block really. If you have 75% of your kernel not written yet, you will probably not be able to keep your registercount 16 or lower.

You might try to tell nvcc --maxregs=16, but that will make your kernel use (slow) local memory. But if you really have no option to go lower than 512 threads, that might be your best solution.

Keeping register count low:

  1. use shared memory, but not always very succesfull, but you have 4k left, so 2 floats per thread could be stored there. Preferably values that need to be around from beginning to the end of your kernel
  2. recalculate values. I had a kernel where I would calculate & use an index in the beginning and in the end use it again. Calculating the index again at the end made the register to keep the value available in the meantime for other variables.