How does a GPU enforce determinism with random numbers?

I’m using cuda with torch, and I’m wondering how determinism is achieved in a parallel setting. I know that we initialize a random number generator with a seed, but isn’t the scheduling of warps and co. supposed to be non deterministic? In that case when multiple warps try to generate a random number, they would not necessarily get the same one.

You may want to study an actual CUDA random number generator code sample as well as the documentation for CURAND.

Typically, each thread that intends to use the random number generation facility will receive its own “state”. To make a long story short, the state determines where in a particular sequence that thread, when it requests a value, will get its next “random” value.

As is usually the case, pseudo- and quasi- random number generators don’t get “random” or non-deterministic numbers, they get a specific number from a specific point in a very long numerical sequence, which sequence usually has “desirable” statistical properties (and is usually quite “long”, i.e. a great many values can be requested from the sequence before the values/pattern will start to repeat).

The “state” that is carried by each thread, will determine which number it gets from the sequence. Hopefully it will be obvious that such a mechanism is parallelizable. By careful manipulation of the “state” supplied to each thread at the RNG initialization point, you can cause each thread to get:

  1. The exact same number as another thread. (ie. same sequence, same starting point)
  2. The same sequence, but a different number (same sequence, different starting point)
  3. A number from a different sequence, compared to another thread.

Typically, each time a thread requests a value, its state is updated during the process. This update then effectively identifies the next value, whenever it will be requested.

The process is highly deterministic and controllable, and is not a function of thread scheduling, but rather a function of the initial state and the actual sequence of requests that a thread makes.

ps: If you are prepared to accept a minimal PRNG the state need not be very big.
Park-Miller needs only 32 bits.


Thanks I understand it now.