Parallelizing Linear Congruential RNG and Reproducibility of Output

All,

I’m interested in parallelizing a LCF based random number generator.

I’m not interested in examining other types of RNGs at this time , such as the merseinne twister.

It’s readily apparent to me that the LCF is quite easily parallelizable by generating subsequences of the LCF sequence on multiple processors.

What is not apparent to me is how to implement such a parallelization so that the results are reproducible.

Consider the following hypothetical approach:

  1. Generate 100 billion random values
  2. CUDA determines number of GPUs available to run subsequences and generates 100 billion/# of cores values on each GPU.
    NOTE: There would not necessarily be an equal number of values generated on each core because 100 billion/# cores may not be an integer.

If one applied the same random value generation approach on another computer with fewer cores, or if the number of available cores changed, then the number of values generated per core would be different.
I don’t think one would be guaranteed to have reproducible results under such a scenario.

Question:
Does anyone know of an approach to parallelize LCF RNGs that can handle, e.g. have reproducible results, situations where the number of available cores can change?

Barring that, is it possible to tell CUDA how many GPUs to use for a particular algorithm? e.g. specify the number of subsequences to run for the LCF. While not optimal, I could at least implement an LCF RNG by specifying the number of GPUs one HAS to use.

Dave

p.s. I’m very new to the whole CUDA programming topic and mostly new to parallel processing so please bear with me :)

Yes. Read this paper by Van Meel et al. http://arxiv.org/PS_cache/arxiv/pdf/0709/0709.3225v1.pdf There is a section on random numbers. (note that they benchmark agains the C stdlib rand() and not an optimized CPU calculation).

Or download their source from:

http://www-old.amolf.nl/~vanmeel/mdgpu/index.html

I really cannot recommend LCF for any serious application, though, even the one they report in that paper. I hope you are just doing this as an exercise.

You just need to run multiple host threads and have each host thread call cudaSetDevice to run on different devices.