Is there any way that I can use the
registers such that are visible for a warp/block of threads to access, just like the
shared memory ?
One more assumption is that I already have the size for each warp/block, for example, I need
The major reason for this is that I found the random access from a warp to shared memory is very slow in the case of uncoalsed access required by the application.