I use a varient of the Combined Tausworthe Generator, where I partially share states across a warp. This allows a fairly high quality RNG (better than LCG anyway) while still taking only 32 words of shared memory per block. If you’re worried about the occasional collision between different warps in the block ending up with the same random numbers, then you’ll need to either use atomics, or else just have 1 word per thread.
Here’s the code:
__constant__ unsigned int shift1[4] = {6, 2, 13, 3};
__constant__ unsigned int shift2[4] = {13, 27, 21, 12};
__constant__ unsigned int shift3[4] = {18, 2, 7, 13};
__constant__ unsigned int offset[4] = {4294967294, 4294967288, 4294967280, 4294967168};
__shared__ unsigned int randStates[32];
__device__ unsigned int TausStep(unsigned int &z, int S1, int S2, int S3, unsigned int M)
{
unsigned int b = (((z << S1) ^ z) >> S2);
return z = (((z &M) << S3) ^ b);
}
__device__ unsigned int randInt()
{
TausStep(randStates[threadIdx.x&31], shift1[threadIdx.x&3], shift2[threadIdx.x&3],shift3[threadIdx.x&3],offset[threadIdx.x&3]);
return (randStates[(threadIdx.x)&31]^randStates[(threadIdx.x+1)&31]^randStates[(threadIdx.x+2)&31]^randStates[(threadIdx.x+3)&31]);
}
The biggest problem with this generator is that the compiler wants to assign it a rather large number of registers, likely for all the shift coefficients. Ideally, it would use the constant memory cache for this…
Would it be possible to make the shift table constant volatile?