Hi, All
Is there any way that I can use the registers
such that are visible for a warp/block of threads to access, just like the shared memory
?
One more assumption is that I already have the size for each warp/block, for example, I need 64xsizeof(float)
registers.
The major reason for this is that I found the random access from a warp to shared memory is very slow in the case of uncoalsed access required by the application.
Thanks