For a simulation I first call a kernel to pre-generate the starting states (one per thread) then launch my primary kernel and load that state into a local curandStatePhilox4_32_10_t value from which I generate my uniform random numbers during the simulation.
In the process of trying to reduce register usage I instead allocated in shared memory one curandStatePhilox4_32_10_t value per thread (64 in block) and use that value for generation in a manner like this;
__shared__ curandStatePhilox4_32_10_t localState[THREADS_SMALL+1]; localState[threadIdx.x]=rngStates[tid]; ..curand_uniform(&localState[threadIdx.x])..
This shift did not reduce my register usage per thread for that kernel, but seems to be a bit faster than the more typical method which looks like this;
curandStatePhilox4_32_10_t localState = rngStates[tid]; ..curand_uniform(&localState)..
So my naive questions are;
If I shifted my random variable from register space to shared why did the compile output which shows register usage per thread not change? Should it have changed at all? Could there be more complicated compiler considerations going on?
Other than potential bank conflicts from the use of shared memory is there any reason to choose one approach over another assuming each thread in a block may generate a different range of random numbers from its unique state? Each thread can generate anywhere from 20 to 1e6 random numbers during simulation.