random number generation using curand curandStatePhilox4_32_10_t, use local registers or __shared__

For a simulation I first call a kernel to pre-generate the starting states (one per thread) then launch my primary kernel and load that state into a local curandStatePhilox4_32_10_t value from which I generate my uniform random numbers during the simulation.

In the process of trying to reduce register usage I instead allocated in shared memory one curandStatePhilox4_32_10_t value per thread (64 in block) and use that value for generation in a manner like this;

__shared__ curandStatePhilox4_32_10_t localState[THREADS_SMALL+1];



This shift did not reduce my register usage per thread for that kernel, but seems to be a bit faster than the more typical method which looks like this;

curandStatePhilox4_32_10_t localState = rngStates[tid];

So my naive questions are;

  1. If I shifted my random variable from register space to shared why did the compile output which shows register usage per thread not change? Should it have changed at all? Could there be more complicated compiler considerations going on?

  2. Other than potential bank conflicts from the use of shared memory is there any reason to choose one approach over another assuming each thread in a block may generate a different range of random numbers from its unique state? Each thread can generate anywhere from 20 to 1e6 random numbers during simulation.

What is sizeof(curandStatePhilox4_32_10_t)? If it is small, any anticipated difference in register use may me masked by second-order effects (+/- 2 registers) during register allocation that can occur with any small code change/ We might also consider that storage in shared memory still requires some sort of pointer to access that data and some temporary register storage as portions of the state are accessed. Depending on any uses of launch_bounds() it is also possible that the compiler immediately makes use of any additional registers freed up by the change to reduce the amount of re-computation or spilling, thus reducing dynamic instruction count and delivering a slight speedup.

You would have to analyze the SASS to really find out what happened. The results of this code change may also play out differently with different versions of the tool chain.

LOL, I was just looking at that and was surprised to find that the size is a whopping 64 bytes each.

ran some tests simulating 2^21 photons

Test #1 using about 6000 bytes shared memory per thread block and 59 registers;

Device bytes allocated=293601840 

Time rand gen= 0.003000 

Running simulation of 2097152 photons saving starting and exit data position, direction,weight,time,medium and detector accumulations..

Time MC kernel = 0.404000

Test #2 using 1920 bytes shared memory and 57 registers with the same input parameters and seed;

Device bytes allocated=293601840 

Time rand gen= 0.002000 

Running simulation of 2097152 photons saving starting and exit data position, direction,weight,time,medium and detector accumulations..

Time MC kernel = 0.432000

So even though the first implementation uses both more registers and 3 times as much shared memory it is

over 6% faster than the approach using less registers and shared memory.

I did run a number of test with different seeds and that difference seems to be about right across my sample.

This was run on Maxwell, and since Kepler has less shared memory I would guess that the shared approach on that architecture would limit occupancy. This application has to run on both architectures so probably will settle for the standard approach, but interesting results.

Why is that curandStatePhilox4_32_10_t type so big? Need to look into that…

So this is what that type is made of;

struct curandStatePhilox4_32_10 {
	uint4 ctr;
	uint4 output;
    uint2 key;
	unsigned int STATE;
	int boxmuller_flag;
	int boxmuller_flag_double;
	float boxmuller_extra;
	double boxmuller_extra_double;

the curandState_t is 48 bytes is size and the real big boy is the curandStateSobol32_t which is 140 bytes.

This partially explains why many older Monte Carlo simulations seem fast initially because they use less robust random number generation methods.

In my situation I do need a robust generation method so will stick with curandStatePhilox4_32_10_t for now.

It is difficult to tell why all this data has to be stored as part of the PRNG state, and in particular why two ‘int’ variables had to be set aside for the Box-Muller flags. From the name I would guess those are just booleans and wonder whether that information could have been rolled into ‘boxmuller_extra’ and ‘boxmuller_extra_double’.

As for the register use, keep in mind that the physical register allocation on the device has a certain architecture-specific granularity. Any increase in the numbers of registers used that does not change the number of allocation units is going to be neutral in terms of occupancy. That is likely the case here.

Something similar came up recently https://devtalk.nvidia.com/default/topic/869662/how-does-a-gpu-enforce-determinism-with-random-numbers-/?offset=3#4650443

So let me repeat; if you are prepared to accept a minimal PRNG the state need not be very big.
Park-Miller needs only 32 bits.



If minimal PRNG state for short per-thread random number sequences is desired, I would suggest using a 32-bit counter that is passed through a hash function with good avalanching properties. Basically this would be “Philox-lite” approach. It will likely be somewhat slower than Park-Miller due to the cost of the hash function, but the PRNG sequence is easily partitioned between threads due to the trivial skip implementation. It can also avoid the problem that LCGs like Park-Miller provide reduced randomness in the low-order bits.

Given that the primary bottleneck in my simulations is the random number generation I am going to spend a bit of time on finding a faster method for my specific use case.

The fastest generation method using curand() is the default XORWOW, but the initialization time of the states can be quite long.

Could you provide a link (or links) to a hash function implementation with good avalanching properties?

Thanks to both wlangdon and njuffa for you advice on this matter.

I have used the mix() function from http://burtleburtle.net/bob/hash/doobs.html for more than a dozen years, as the quality seemed right and the license made it compatible with any kind of project.

There used to be a website that carefully compared (and visualized) the avalanching properties of various hash functions. I cannot find that site now but I do recall that this was among the best they had tested. My own testing showed the avalanching properties to be good. Given that GPUs provide fast integer multiplies, you may want to search for a faster hash function that utilizes those.

I cannot stress enough that PRNGs with a tiny 32-bit state should only be used to generate very short sequences. For long running simulations making use of long sequences one would want lots of PRNG state as one of the basic design criteria.

[Later:] I tried to find the website with the hash function comparisons, but while I was able to find some sites with somewhat similar results and presentations, I was unable to find the one with result visualization of avalanching tests that I recall. However, I did find some old notes of mine that indicate that “Murmur3” might be a good alternate hash function to the Jenkins hash function, in terms of quality, performane, and software license. I was looking at implementing a cuckoo hash at the time which required two different hash functions.

In case anyone wants more on Park-Miller they wrote it up for the ACM Communications

Random Number Generators: Good Ones are Hard to Find, S. K. Park and K. W. Miller,
Practical and theoretical issues are presented concerning the design,
implementation, and use of a good, minimal standard random number generator
that will port to virtually all systems.