making XORWOR parallel with curan_init concerns about statistical properties curan-library

Hi there,

I just switched to CUDA 4.0 and tried some of the new features like the curand-library. I am especially interested in using the pseude-rng. They use the XORWOW of Marsaglia (http://www.jstatsoft.org/v08/i14/paper), using a simple parallelization method.

Let s be the seed, f the function to obtain the next sequence element. For each thread the RNG is initialized with

thread 0: f0(s)
thread 1: f
{267}(s)
thread 2: f
{2**(2*67)}(s)

This is taken from http://www.nvidia.com/content/GTC-2010/pdfs/2216_GTC2010.pdf

The XORWOW has a period of 2**192 - 1. Having thousands of threads, the period will be exceeded by far. In the worst case you might have two threads producing perfectly correlated pseudo-rng numbers. Nvidia did some testing on the curan-XORWOW, but I am not sure how. Of course, one thread will have a pretty good RNG, passing most of the tests, but that doesn’t say a thing about the parallelized RNG.

Three questions arise:
(1) Why did they use 67 in the power? Does that guarantee desired statistical properties of the parallel XORWOW?
(2) Was the parallel XORWOW really properly tested, or was that a simple 1 thread test?
(3) I am all wrong?

Thank you for your comments.

I think the slides are mistaken. The actual CURAND manual:

http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/CURAND_Library.pdf

says that the RNG state after curand_init will be (267) * sequence + offset. In that case, there are 2125 distinct subsequences available, and therefore no chance of overlap in any imaginable usage.

I am going to assume that NVIDIA tested their RNG in parallel, but I have no personal knowledge of that. :)

Thanks for your interest in CURAND. Yes, sorry to say there is an error in the slide deck, and I can see your concern, and certainly if we were doing precisely what you (and the slide deck) suggest, it would be a problem! For the initialization you had:

thread 0: f0(s)
thread 1: f
{267}(s)
thread 2: f
{2**(2*67)}(s)

And in fact this would result in wrapping around the (2**192)-1 sequence
in a small number of threads. Actually the initialization, for 4096
threads, is:

thread 0: f0(s)
thread 1: f
{1*(267)}(s)
thread 2: f
{2*(267)}(s)
thread 3: f
{3*(267)}(s)

thread 4095: f
{4095*(2**67)}(s)

As you can see, with 4096 threads we are into the sequence
(212)*(267) = (279) elements, and so we are not really pushing the limits of the sequence. The value (267) was chosen to make it highly unlikely that one thread would over-run the next thread’s sequence. As it happens, the statistical results with this value, and with our re-ordering of the sequence, are as good as with a single sequence.

In our testing, we use the TestU01 Crush test suites from the University of Montreal to test the statistical properties of XORWOW. In that testing, we use a thread count of 4096, as in the initialization sequence above. The most rigorous of the tests is Big Crush, which takes about 5 hours to run on an Intel Core I7 with a C2070 GPU. Typical results from Big Crush with random seeds are published in the CURAND manual that is available with the 4.0 release. We do pretty well, though there are occasional marginal results on a couple of the sub-tests.

Regards
Paul Sidenblad
CUDA Math Library Team

I screwed up that slide, sorry about that. Paul is right, it’s really f**{n*(2**67)}(s) for thread n.

Cool. That resolves all doubts. Thank you very much for clearing up.

Peter