Quick benchmark comparison of different parallel random number generators

CudaaduC · December 9, 2015, 10:21pm

Not an expert in random number generation but since I am using them so much these days and they are the main bottleneck in my simulations I though to do some naive basic tests.

In order to justify my choice of using the curand Philox method of generation of uniform float random number for a simulation I compared 4 different methods;

curand default XORWOW
curand Philox32
curand MRG32k3a
A custom method used by the open source X-ray simulation application MCPGU which uses a combination of two multiplicative linear congruential generators (MLCGs). Nickname RANECU.

source paper:

https://inte.upc.edu/downloads/cloneasy-1/badal2006computphyscommun175p440.pdf

The results were broken down into

time to set up states
time to generated “n” per state 32 bit float uniform random numbers
memory required for state arrays

Philox vs custom MLCG 1,000,000 states generating 1009 number per state (saving a small subset to device memory)

num to do=1000000, num_reps=1009, total_number random number generated=1009000000

bytes used for curand states= 64000000

bytes used for custom states= 8000000

time curand setup=0.001000 

time custom setup=0.009000 

 time curand for generation of 1009000000 random uniform numbers= 0.021000 

 time custom for generation of 1009000000 random uniform numbers= 0.016000 

Total time for random number generation curand= 0.022000 

Total time for random number generation custom= 0.025000

Winner Philox

Philox vs custom MLCG 1,000,000 states generating 11009 number per state (saving a small subset to device memory)

num to do=1000000, num_reps=11009, total_number random number generated=11009000000

bytes used for curand states= 64000000

bytes used for custom states= 8000000

time curand setup=0.001000 

time custom setup=0.009000 

 time curand for generation of 11009000000 random uniform numbers= 0.224000 

 time custom for generation of 11009000000 random uniform numbers= 0.159000 

Total time for random number generation curand= 0.225000 

Total time for random number generation custom= 0.168000

Winner custom MLCGs(RANECU)

Philox vs custom 10,000,000 states generating 1009 numbers per state;

num to do=10000000, num_reps=1009, total_number random number generated=10090000000

bytes used for curand states= 640000000

bytes used for custom states= 80000000

time curand setup=0.011000 

time custom setup=0.093000 

 time curand for generation of 10090000000 random uniform numbers= 0.212000 

 time custom for generation of 10090000000 random uniform numbers= 0.178000 

Total time for random number generation curand= 0.223000 

Total time for random number generation custom= 0.271000

Winner curand Philox

The default curand XORWOW vs custom 1,000,000 states 11009 number generated per state;

num to do=1000000, num_reps=11009, total_number random number generated=11009000000

bytes used for curand states= 48000000

bytes used for custom states= 8000000

time curand setup=3.062000 

time custom setup=0.008000 

 time curand for generation of 11009000000 random uniform numbers= 0.039000 

 time custom for generation of 11009000000 random uniform numbers= 0.149000 

Total time for random number generation curand= 3.101000 

Total time for random number generation custom= 0.157000

Winner custom RANECU

curand MRG32k3a vs custom for 1,000,000 generationg 11009 numbers per state;

num to do=1000000, num_reps=11009, total_number random number generated=11009000000

bytes used for curand states= 72000000

bytes used for custom states= 8000000

time curand setup=0.348000 

time custom setup=0.008000 

 time curand for generation of 11009000000 random uniform numbers= 1.977000 

 time custom for generation of 11009000000 random uniform numbers= 0.147000 

Total time for random number generation curand= 2.325000 

Total time for random number generation custom= 0.155000

Winner custom RANECU

And finally curand XORWOW vs custom 10,000,000 states generating 1009 random numbers per state;

num to do=10000000, num_reps=1009, total_number random number generated=10090000000

bytes used for curand states= 480000000

bytes used for custom states= 80000000

time curand setup=62.973000 

time custom setup=0.088000 

 time curand for generation of 10090000000 random uniform numbers= 0.036000 

 time custom for generation of 10090000000 random uniform numbers= 0.137000 

Total time for random number generation curand= 63.009000 

Total time for random number generation custom= 0.225000

Winner (by a HUUUGE margin) custom RANECU

So (as Njuffa pointed out in an older post) the curand default XORWOW is the fastest for the actual generation of random numbers from a state, but the initialization time for setting up the state is by far the slowest approach, and does not seem feasible for more than 1,000,000 states.

The custom approach RANECU can be one of the fastest methods, with a longer setup than Philox, but a quicker generation method.

The curand MRG32k3a is the slowest for generation, but has a shorter initialization time than XORWOW.

Also the custom RANECU has the lowest memory requirements which may matter for some simulations.

Overall it seems to be a toss-up between RANECU and curand Philox.

When I look at the historgrams of a sample of outputs for all the generation methods the histograms look very similar. Not a robust test and I will dig into this more.

Anyone else have any additional insight into this topic or can recommend another CUDA random number generation library or approach which is competitive with Philox?

I would rather not “roll-my-own” random number generator unless there is a significant performance difference over Philox without a loss in pseudo-randomness.

Test platform : Windows 7 x64, CUDA 7.5, GTX Titan X

Robert_Crovella · December 10, 2015, 12:03am

thrust has random number generation capability.

I have not benchmarked it though, and really know little about it.

[url]https://thrust.github.io/doc/group__random.html[/url]

[url]https://github.com/thrust/thrust/blob/master/examples/monte_carlo.cu[/url]

[url]https://github.com/thrust/thrust/blob/master/examples/monte_carlo_disjoint_sequences.cu[/url]

njuffa · December 10, 2015, 12:25am

As far as I recall, MRG32k3a is in essence also a combination of two MLCGs, so it is not clear to me why it is so much slower. However, one aspect missing from the above comparison is obviously the quality of each of the generators involved. Generators that produce number streams that are not very random can be made very fast!

Philox seems to have an interesting combination of widely desirable traits: relative fast random number generation, fast state management, small state. Since optimization is always an iterative process that often requires five or six optimization passes before results are achieved that can be consider close to optimal, I wonder whether NVIDIA could give an additional performance boost to this particular generator. I have not studied Philox in detail, so I am not sure what would be involved. In practical terms, CUDA users who find Philox useful in their work might want to file an RFE with NVIDIA requesting performance improvements to it.

mfatica · December 10, 2015, 12:48am

Take a look at this paper, they had some interesting numbers:

wlangdon · December 11, 2015, 1:23pm

There is a CUDA implementation of Park-Miller’s minimum PRNG at
http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/gp-code/random-numbers/cuda_park-miller.tar.gz

Comments as always welcome
Bill

BulatZiganshin · January 6, 2016, 10:52am

what about using PRNGs described here: xoshiro/xoroshiro generators and the PRNG shootout ?

Topic		Replies	Views
CB-RNG Philox from Random123 CUDA Programming and Performance	2	1168	August 23, 2013
CURAND CURAND low per CUDA Programming and Performance	8	3066	April 12, 2011
Question about optimal cuRAND() use GPU-Accelerated Libraries	7	2719	April 27, 2015
CURAND performace? CUDA Programming and Performance	0	3265	July 14, 2011
Random number generator on GPU CUDA Programming and Performance	9	4473	April 12, 2013
What's a good random number generator? CUDA Programming and Performance	21	13289	May 6, 2009
Investigating RNGs for large numbers of parallel streams? CUDA Programming and Performance	5	14495	January 19, 2011
random number generation using curand curandStatePhilox4_32_10_t, use local registers or __shared__ CUDA Programming and Performance	9	2642	September 11, 2015
random number generation generating random numbers in CUDA CUDA Programming and Performance	64	109957	January 25, 2011
random numbers inside the Kernel CUDA Programming and Performance	31	61067	November 26, 2011

Quick benchmark comparison of different parallel random number generators

Related topics