Quick benchmark comparison of different parallel random number generators

Not an expert in random number generation but since I am using them so much these days and they are the main bottleneck in my simulations I though to do some naive basic tests.

In order to justify my choice of using the curand Philox method of generation of uniform float random number for a simulation I compared 4 different methods;

  1. curand default XORWOW
  2. curand Philox32
  3. curand MRG32k3a
  4. A custom method used by the open source X-ray simulation application MCPGU which uses a combination of two multiplicative linear congruential generators (MLCGs). Nickname RANECU.

source paper:

https://inte.upc.edu/downloads/cloneasy-1/badal2006computphyscommun175p440.pdf

The results were broken down into

  1. time to set up states
  2. time to generated “n” per state 32 bit float uniform random numbers
  3. memory required for state arrays

Philox vs custom MLCG 1,000,000 states generating 1009 number per state (saving a small subset to device memory)

num to do=1000000, num_reps=1009, total_number random number generated=1009000000

bytes used for curand states= 64000000

bytes used for custom states= 8000000

time curand setup=0.001000 

time custom setup=0.009000 

 time curand for generation of 1009000000 random uniform numbers= 0.021000 

 time custom for generation of 1009000000 random uniform numbers= 0.016000 

Total time for random number generation curand= 0.022000 

Total time for random number generation custom= 0.025000

Winner Philox

Philox vs custom MLCG 1,000,000 states generating 11009 number per state (saving a small subset to device memory)

num to do=1000000, num_reps=11009, total_number random number generated=11009000000

bytes used for curand states= 64000000

bytes used for custom states= 8000000

time curand setup=0.001000 

time custom setup=0.009000 

 time curand for generation of 11009000000 random uniform numbers= 0.224000 

 time custom for generation of 11009000000 random uniform numbers= 0.159000 

Total time for random number generation curand= 0.225000 

Total time for random number generation custom= 0.168000

Winner custom MLCGs(RANECU)

Philox vs custom 10,000,000 states generating 1009 numbers per state;

num to do=10000000, num_reps=1009, total_number random number generated=10090000000

bytes used for curand states= 640000000

bytes used for custom states= 80000000

time curand setup=0.011000 

time custom setup=0.093000 

 time curand for generation of 10090000000 random uniform numbers= 0.212000 

 time custom for generation of 10090000000 random uniform numbers= 0.178000 

Total time for random number generation curand= 0.223000 

Total time for random number generation custom= 0.271000

Winner curand Philox

The default curand XORWOW vs custom 1,000,000 states 11009 number generated per state;

num to do=1000000, num_reps=11009, total_number random number generated=11009000000

bytes used for curand states= 48000000

bytes used for custom states= 8000000

time curand setup=3.062000 

time custom setup=0.008000 

 time curand for generation of 11009000000 random uniform numbers= 0.039000 

 time custom for generation of 11009000000 random uniform numbers= 0.149000 

Total time for random number generation curand= 3.101000 

Total time for random number generation custom= 0.157000

Winner custom RANECU

curand MRG32k3a vs custom for 1,000,000 generationg 11009 numbers per state;

num to do=1000000, num_reps=11009, total_number random number generated=11009000000

bytes used for curand states= 72000000

bytes used for custom states= 8000000

time curand setup=0.348000 

time custom setup=0.008000 

 time curand for generation of 11009000000 random uniform numbers= 1.977000 

 time custom for generation of 11009000000 random uniform numbers= 0.147000 

Total time for random number generation curand= 2.325000 

Total time for random number generation custom= 0.155000

Winner custom RANECU

And finally curand XORWOW vs custom 10,000,000 states generating 1009 random numbers per state;

num to do=10000000, num_reps=1009, total_number random number generated=10090000000

bytes used for curand states= 480000000

bytes used for custom states= 80000000

time curand setup=62.973000 

time custom setup=0.088000 

 time curand for generation of 10090000000 random uniform numbers= 0.036000 

 time custom for generation of 10090000000 random uniform numbers= 0.137000 

Total time for random number generation curand= 63.009000 

Total time for random number generation custom= 0.225000

Winner (by a HUUUGE margin) custom RANECU

So (as Njuffa pointed out in an older post) the curand default XORWOW is the fastest for the actual generation of random numbers from a state, but the initialization time for setting up the state is by far the slowest approach, and does not seem feasible for more than 1,000,000 states.

The custom approach RANECU can be one of the fastest methods, with a longer setup than Philox, but a quicker generation method.

The curand MRG32k3a is the slowest for generation, but has a shorter initialization time than XORWOW.

Also the custom RANECU has the lowest memory requirements which may matter for some simulations.

Overall it seems to be a toss-up between RANECU and curand Philox.

When I look at the historgrams of a sample of outputs for all the generation methods the histograms look very similar. Not a robust test and I will dig into this more.

Anyone else have any additional insight into this topic or can recommend another CUDA random number generation library or approach which is competitive with Philox?

I would rather not “roll-my-own” random number generator unless there is a significant performance difference over Philox without a loss in pseudo-randomness.

Test platform : Windows 7 x64, CUDA 7.5, GTX Titan X

thrust has random number generation capability.

I have not benchmarked it though, and really know little about it.

https://thrust.github.io/doc/group__random.html

https://github.com/thrust/thrust/blob/master/examples/monte_carlo.cu

https://github.com/thrust/thrust/blob/master/examples/monte_carlo_disjoint_sequences.cu

As far as I recall, MRG32k3a is in essence also a combination of two MLCGs, so it is not clear to me why it is so much slower. However, one aspect missing from the above comparison is obviously the quality of each of the generators involved. Generators that produce number streams that are not very random can be made very fast!

Philox seems to have an interesting combination of widely desirable traits: relative fast random number generation, fast state management, small state. Since optimization is always an iterative process that often requires five or six optimization passes before results are achieved that can be consider close to optimal, I wonder whether NVIDIA could give an additional performance boost to this particular generator. I have not studied Philox in detail, so I am not sure what would be involved. In practical terms, CUDA users who find Philox useful in their work might want to file an RFE with NVIDIA requesting performance improvements to it.

Take a look at this paper, they had some interesting numbers:

There is a CUDA implementation of Park-Miller’s minimum PRNG at
http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/gp-code/random-numbers/cuda_park-miller.tar.gz

Comments as always welcome
Bill

what about using PRNGs described here: http://xorshift.di.unimi.it/#quality ?