CUDA places arrays in registers only in some limited cases; otherwise spilling them to local memory or requiring me to place them in shared memory. One of these conditions is when the array is indexed by non-constant expressions. Is this limitation common to all compute capabilities? Is there any way to store an array in register storage and have different threads access different elements of their respective arrays (even if specific to some subset of CUDA hardware)? I am happy to write any such code in PTX but there is no documented syntax that will let me"access register whose index is found in another register". Is there a means of simulating this in a more long-hand fashion whilst still just using register storage?
I meant to say that I need to update the array too… so using texture or constant memory isn’t appropriate :-)
There is no way to do this in any compute capability, register numbers are always hard-coded in assembly.
If your threads always access the same element of the array, then you can just use an automatic variable (a register) for this. Otherwise you have to go through shared memory.
Thanks for the quick reply.
I had read the PTX manual and seen how registers are explicitly specified in each instruction. I surmise that the register target is encoded as part of the instruction but wondered if there was a way to modify the target… the increase of register storage over shared memory made it an attractive target. Thanks for the help - it will stop me chasing down a blind alley and get me looking at other solutions.
Don’t underestimate the speed of local memory on Fermi devices. The L1 cache can make those accesses faster than you expect even when doing runtime variable indexing of local arrays.
On earlier devices, yes, local memory was nearly always evil because of its slow speed, but Fermi hides much of that pain by simple cache.