CURAND: Independence of RND numbers Are Host API generated RND numbers independant?

Hello,

Regarding the Host API from CURAND, it is not clear how the numbers are generated on the device. It is ok if nvidia keeps it secret, just common business. But:

A common problem in generating random numbers in parallel is, the more parallel generators exist, their sequences overlap. That is why I don’t use the Device API.

The Host API is still a black box to me, and I don’t want the source to be published, but my question is:

Are the random numbers from the Host API independant?

It’s not a secret, don’t worry! External Image

In the CUDA Toolkit 4.0 CURAND Guide, page 6, there is a formula that shows how pseudorandom results are arranged for the ordering CURAND_ORDERING_PSEUDO_DEFAULT.

“The result at offset n in global memory is from position (n mod 4096) * 2^67 + floor(n / 4096) in the XORWOW sequence.”

What’s happening is that the library first allocates space for the state of 4096 threads. Then it precomputes the starting state for each of the 4096 threads. All the threads start from a common state that is computed from the seed, then advanced by 2^67 steps times the thread number. So thread 0 is advanced 0 steps, thread 1 is advanced 2^67 steps, thread 2 is advanced 2*2^67 steps, … To generate 4096 output values, each thread uses its state to get a single output value, then advances one step.

Each thread can generate 2^67 values before it starts to overlap with the sequence from any other thread.

If you choose the CURAND_ORDERING_PSEUDO_SEEDED ordering, then the states are setup slightly differently. Each of the 4096 threads gets an initial state based on the seed and on the thread number. This is much faster than advancing by steps of 2^67, but it doesn’t give you a guarantee that the subsequences won’t overlap.

Hopefully this answers your question, let me know if I misinterpreted your question or if you want more details about anything.

One nice thing about the lack of a device-side linker is that the source code for all device functions has to be written out in the CUDA headers. You can see the implementation of the XORWOW algorithm in curand_kernel.h. (There are several generators in that file, so you have to read the comments to make sure you are looking at the right one.)

Thank you both.

Looks like I misunderstood the CURAND guide on that page, thanks for making that clear Nathan! It was the exact answer to my question.

And yes seibert, I’ll have a look at the curand_kernel.h