Not an expert in random number generation but since I am using them so much these days and they are the main bottleneck in my simulations I though to do some naive basic tests.
In order to justify my choice of using the curand Philox method of generation of uniform float random number for a simulation I compared 4 different methods;
- curand default XORWOW
- curand Philox32
- curand MRG32k3a
- A custom method used by the open source X-ray simulation application MCPGU which uses a combination of two multiplicative linear congruential generators (MLCGs). Nickname RANECU.
source paper:
https://inte.upc.edu/downloads/cloneasy-1/badal2006computphyscommun175p440.pdf
The results were broken down into
- time to set up states
- time to generated “n” per state 32 bit float uniform random numbers
- memory required for state arrays
Philox vs custom MLCG 1,000,000 states generating 1009 number per state (saving a small subset to device memory)
num to do=1000000, num_reps=1009, total_number random number generated=1009000000
bytes used for curand states= 64000000
bytes used for custom states= 8000000
time curand setup=0.001000
time custom setup=0.009000
time curand for generation of 1009000000 random uniform numbers= 0.021000
time custom for generation of 1009000000 random uniform numbers= 0.016000
Total time for random number generation curand= 0.022000
Total time for random number generation custom= 0.025000
Winner Philox
Philox vs custom MLCG 1,000,000 states generating 11009 number per state (saving a small subset to device memory)
num to do=1000000, num_reps=11009, total_number random number generated=11009000000
bytes used for curand states= 64000000
bytes used for custom states= 8000000
time curand setup=0.001000
time custom setup=0.009000
time curand for generation of 11009000000 random uniform numbers= 0.224000
time custom for generation of 11009000000 random uniform numbers= 0.159000
Total time for random number generation curand= 0.225000
Total time for random number generation custom= 0.168000
Winner custom MLCGs(RANECU)
Philox vs custom 10,000,000 states generating 1009 numbers per state;
num to do=10000000, num_reps=1009, total_number random number generated=10090000000
bytes used for curand states= 640000000
bytes used for custom states= 80000000
time curand setup=0.011000
time custom setup=0.093000
time curand for generation of 10090000000 random uniform numbers= 0.212000
time custom for generation of 10090000000 random uniform numbers= 0.178000
Total time for random number generation curand= 0.223000
Total time for random number generation custom= 0.271000
Winner curand Philox
The default curand XORWOW vs custom 1,000,000 states 11009 number generated per state;
num to do=1000000, num_reps=11009, total_number random number generated=11009000000
bytes used for curand states= 48000000
bytes used for custom states= 8000000
time curand setup=3.062000
time custom setup=0.008000
time curand for generation of 11009000000 random uniform numbers= 0.039000
time custom for generation of 11009000000 random uniform numbers= 0.149000
Total time for random number generation curand= 3.101000
Total time for random number generation custom= 0.157000
Winner custom RANECU
curand MRG32k3a vs custom for 1,000,000 generationg 11009 numbers per state;
num to do=1000000, num_reps=11009, total_number random number generated=11009000000
bytes used for curand states= 72000000
bytes used for custom states= 8000000
time curand setup=0.348000
time custom setup=0.008000
time curand for generation of 11009000000 random uniform numbers= 1.977000
time custom for generation of 11009000000 random uniform numbers= 0.147000
Total time for random number generation curand= 2.325000
Total time for random number generation custom= 0.155000
Winner custom RANECU
And finally curand XORWOW vs custom 10,000,000 states generating 1009 random numbers per state;
num to do=10000000, num_reps=1009, total_number random number generated=10090000000
bytes used for curand states= 480000000
bytes used for custom states= 80000000
time curand setup=62.973000
time custom setup=0.088000
time curand for generation of 10090000000 random uniform numbers= 0.036000
time custom for generation of 10090000000 random uniform numbers= 0.137000
Total time for random number generation curand= 63.009000
Total time for random number generation custom= 0.225000
Winner (by a HUUUGE margin) custom RANECU
So (as Njuffa pointed out in an older post) the curand default XORWOW is the fastest for the actual generation of random numbers from a state, but the initialization time for setting up the state is by far the slowest approach, and does not seem feasible for more than 1,000,000 states.
The custom approach RANECU can be one of the fastest methods, with a longer setup than Philox, but a quicker generation method.
The curand MRG32k3a is the slowest for generation, but has a shorter initialization time than XORWOW.
Also the custom RANECU has the lowest memory requirements which may matter for some simulations.
Overall it seems to be a toss-up between RANECU and curand Philox.
When I look at the historgrams of a sample of outputs for all the generation methods the histograms look very similar. Not a robust test and I will dig into this more.
Anyone else have any additional insight into this topic or can recommend another CUDA random number generation library or approach which is competitive with Philox?
I would rather not “roll-my-own” random number generator unless there is a significant performance difference over Philox without a loss in pseudo-randomness.
Test platform : Windows 7 x64, CUDA 7.5, GTX Titan X