It’s not a secret, don’t worry!

In the CUDA Toolkit 4.0 CURAND Guide, page 6, there is a formula that shows how pseudorandom results are arranged for the ordering CURAND_ORDERING_PSEUDO_DEFAULT.

“The result at offset n in global memory is from position (n mod 4096) * 2^67 + floor(n / 4096) in the XORWOW sequence.”

What’s happening is that the library first allocates space for the state of 4096 threads. Then it precomputes the starting state for each of the 4096 threads. All the threads start from a common state that is computed from the seed, then advanced by 2^67 steps times the thread number. So thread 0 is advanced 0 steps, thread 1 is advanced 2^67 steps, thread 2 is advanced 2*2^67 steps, … To generate 4096 output values, each thread uses its state to get a single output value, then advances one step.

Each thread can generate 2^67 values before it starts to overlap with the sequence from any other thread.

If you choose the CURAND_ORDERING_PSEUDO_SEEDED ordering, then the states are setup slightly differently. Each of the 4096 threads gets an initial state based on the seed and on the thread number. This is much faster than advancing by steps of 2^67, but it doesn’t give you a guarantee that the subsequences won’t overlap.

Hopefully this answers your question, let me know if I misinterpreted your question or if you want more details about anything.