CUDA Block-level Shared Registers

Hi, All

Is there any way that I can use the registers such that are visible for a warp/block of threads to access, just like the shared memory ?

One more assumption is that I already have the size for each warp/block, for example, I need 64xsizeof(float) registers.

The major reason for this is that I found the random access from a warp to shared memory is very slow in the case of uncoalsed access required by the application.