I have a fairly complex kernel that I would like to run. Problem is that I get the message “too many resources…” after I have written about one quarter of it. After using the “–ptxas-options=-v” option I get that I am using 18 registers and ~12k shared memory.
In my case it seems like the register count that are the troublesome part. I can’t go smaller than (512, 1, 1) in block size. So that gives me a limit of 16 registers (16 x 512 = 8192, which is the limit right?).
So what can I do/think of when I am coding to keep the register count low? I use some texture-lookups that I guess that I could move to global memory, but I would like to keep them if possible. Tips and tricks anyone?
(Using CUDA 2.0 and a 8800 Ultra)