CURAND (device) seems to give correlated outputs among threads how to avoid?

Hello all,

I’m using CURAND (4.0) on the device to populate a grid with agents, using a variation of the example from the CURAND manual. I have a 1024x1024 grid with 1024 blocks, 1024 threads and add 2M agents by iterating twice. When I count the occupancy of the grid squares, I get more than 36 (my overflow level) on some squares, different ones on each run. If I use the Steve Park-Dave Geyer C (rngs) library routines on the host to populate my grid, I get no overflows as counted by the same routine. I’m using different seeds for each thread (1237*(blockIdx.x*blockDim.x+threadIdx.x)%LONG_MAX) and I’ve tried varying the sequence number with the thread id as well, but that doesn’t change the problem. I need to generate randoms on the device throughout my simulation, but figure I need to make sure I am using CURAND correctly before proceeding further.

I’d appreciate some guidance from our resident gurus. :)

Thanks,

CRF

I haven’t double checked for correlation yet, but I felt that setting up some millions of CURAND sequences seemed to be overkill. I have a need for a grid of random numbers, corresponding to 1-4 million pixels, usually.
It seems sloppy, and wastes SOME memory, but perhaps not as much wasted memory as retaining 4 million random
sequences.

I generate some 4096 sequences, and spawn a kernel to generate my 2048x2048 grid of random numbers.
Each thread stores 1024 numbers from each sequence into my grid.
After that, I spawn my main processing kernel, with 4M total threads, where each thread
grabs one of these random numbers.

I wish I could just get one random number in each of my main 4M threads, but
that does not seem to be the best way to work with CURAND, AFAIK.

This sounds like your threads are not in fact starting with unique RNG states. In general, you should not assume that different seeds produce independent sequences with a random number generator, although you can often get away with it. (No idea how XORWOW behaves in this case.) The correct thing to do is to use the same seed for all your threads, but picking the sequence number to be (blockDim.x*blockIdx.x + threadIdx.x). That will start each thread at different points in the same sequence, separated by 2**67 steps so they won’t overlap.

If in this case, you still see problems, then it sounds like the threads don’t actually have independent RNG states, or somehow on the GPU you are translating the output of the RNG to a cell index incorrectly.

Thanks, siebert, I tried this advice. On the first run, it seemed to help, but on repeated runs, I start to see overflows, and they persist in later runs. I don’t think it’s my way of turning RNs into grid locations, or I’d see it with the other RNG as well. I checked that all CUDA memory is freed. Not sure what else to try…

I’m not too familiar with the Park-Geyer RNG, but if the comments in the implementation I found with Google are to be believed, the XORWOW algorithm is probably of higher quality:

http://www.cs.ucr.edu/~ciardo/teaching/CS177/source/rng.c

The period for a single stream of this is only 231 - 1, which is remarkably short for an RNG. (My code drew 241 random numbers from a single sequence just last night!) Period length is not the only measure of RNG quality, but you want something much bigger than your planned usage. Since you only need 2 million numbers to populate your grid, you’re probably fine with either RNG, but it would only take 1000 steps of that to exhaust the Park-Geyer RNG sequence for a given seed. Presumably having 256 streams to draw from buys you some extra headroom beyond that…

RNG construction is a fascinating and tricky business, much like cryptography. :)