on generating random numbers in parallel

If I wanted to use a rand() function in a cuda kernel, would I have to store a seed for every single thread which gets run? Or is it mathematically ok to save memory by storing a seed for each block, and use (seed[blockId] + threadId) for each thread’s seed? (And update the block’s seed from thread 0 each time)

If you just use an ordinary linear congruence rand() function that works by multiplying the seed by a number, then rand(seed+threadIdx) will be pretty correlated to rand(seed). Of course, it depends on what your requirements are.

There’s some very good threads on generating random numbers in these forums. Some algorithms are complex, but others are very simple and fast.

Not sure if you’re implementing a random number generator from scratch or not, but in case you haven’t noticed, there’s a MersenneTwister example in SDK which is said to be “one of the best available” in its doc. :rolleyes:

A problem I think I have with both CURAND and the MercenneTwister seems to be that they want to set up a state for each thread. What if I have MILLIONS of threads? One would think that is common. There must be a more efficient way to share state vectors than one/thread. Some sort of synchronization to have a state attached to a core, and have all threads on that core share that state?

Given the overhead of state-locking, I’m not sure this would actually be more efficient than just having a separate state for each thread. The Mersenne Twister RNG would be a bad choice in this case, as its state vector is a 624 integer array. The XORWOW algorithm provided by CURAND has 6 integers + 16 more bytes for Box-Muller state information. If you have millions of threads, that’s only tens of megabytes of random number state storage. The curand_init() sequence number argument will let you initialize all of these state vectors so that the sequences are statistically independent, but from one seed value. There is some overhead to initialization, so you will want to use the same curandState_t structs across several kernel calls, if possible.

(Incidentally, you might want to start a new thread on this topic rather than posting variations on your question in many old threads…)