What's a good random number generator?

Since rand() cannot be used in CUDA, what is a good random number generator that provides uniform distribution over a certain range of numbers that I could use?

There’s a Mersenne Twister example in the SDK. That’s a good quality generator, though I’m not sure if the example shows how to use the results in device code.

I think that if you do a search on the forum you can find a lot of them. there has been a discussion about it a while ago. Can’t find it that quick but there is something…

Check it out.

The mersenne twister19937 is the best, but if you want something simple

and possible without double:

take the Wichmann/Hill (the period is 10^12).

There are 3 componants (integer) in the “diffusion”.


x(i)=171*x(i-1) mod 30269

y(i)=172*y(i-1) mod 30307

z(i)=170*z(i-1) mod 30323

and u(i)=((x(i)/30269) + (y(i)/30307) + (z(i)/30323)) mod 1

u(i) is an uniform in [0,1]

Thanks, I suppose the above could be put into a kernel function that can be run and used within device code? And would I basically increment ‘i’ every time I want a random number?

You would not have any i in your code I think, I would think it looks something like this:

unsigned int x=100,y=20,z=2500;

repeat this part for every uniform randnr you want:

x *=171;





z%= 30323;

float u=((((float)x)/30269.0)  + (((float)y)/30307.0)  + (((float)z)/30323.0)) % 1.0;

But you have to make sure each thread gets its own set of initial x,y,z values, otherwise all threads

Thanks for the clarification. So each thread would have the same initial values (int x=100,y=20,z=2500)? Otherwise…what?

I think DenisR said: each thread must have different initial values.

If each thread starts with the same (x,y,z), then they all use the same serie of numbers. Not very random.

If the (x,y,z) is in global memory and the update of (x,y,z) is atomic in each call to “device_rand()” from threads, then threads get different values as they should. But atomic update of three values forces serial behaviour.

With “local rand”, if the period is 10^12 and you have N threads, then logically – to remain unique – each thread would start with local (x,y,z) that is 10^12/N steps forward from the start of the previous thread.

Thanks for the good info.

I’d just like to add that Wichmann has a new 4-sequence algorithm that is better than the 3-sequence mentioned here and the paper mentions parallelism.

The paper’s title is Generating good pseudo-random numbers 2006.

Can you please post a code snippet? I can’t access the paper.

Also, does anyone know a simple generator that only needs one variable? I’d like to put it into smem. I have half a mind to use randu(), but there must exist something as simple but a bit better. I honestly don’t need something good.

I consider this to be a simple generator with good properties that uses far less state variables as compared to Mersenne Twister.

Here is the 32-bit version

     // Assuming the width of integers are 32-bits

      ix = 11600 * ( ix % 185127 ) - 10379 * ( ix / 185127 );

      iy = 47003 * ( iy % 45688 ) - 10479 * ( iy / 45688 );

      iz = 23000 * ( iz % 93368 ) - 19423 * ( iz / 93368 );

      it = 33000 * ( it %  65075 ) - 8123 *  ( it / 65075 );


      if ( ix < 0 ) ix += 2147483579;

      if ( iy < 0 ) iy += 2147483543;

      if ( iz < 0 ) iz += 2147483423;

      if ( it < 0 )  it  += 2147483123;


      float W = ( ((float) ix)/2147483579.0 + ((float) iy)/ 2147483543.0

                              + ((float) iz)/2147483423.0 + ((float) it)/2147483123.0 );


      return fmod(W, 1.0);

Monte-carlo simulation example makes use of the MT generator - i believe (not yet checked… but dats what I used for my MC)

You can always launch a kernel to generate the random numbers in global memory required for your simulation (could be quite huge)


then launch your kernel that uses the random numbers (you can index them sequentialy…)

If you originally wanted to use rand() why do you insist on having a good one to replace it?

Or to say it differently: If rand() is good enough for you, any of the previous ones are sever overkill.

See http://en.wikipedia.org/wiki/Linear_congruential_generator for the trivial ones that you get when using rand().

“man rand” also lists the one suggested by POSIX.

Either way, don’t forget to check that the random number generator you use is actually suitable (and also that it is not overkill) for what you do.

Thanks. A linear congruence generator is exactly what I needed.

Linear congruential generators are already implemented in CUDA

See for example

Note that the generated random sequence is exactly that of the rand48 algorithm provided by the C standard library, but the implementation has been parallelized in a clever way.

I’ve modified this implementation so that random numbers can be generated, one at a time, within a CUDA device function. The code is attached.

Thanks. The defintion of ‘uint2’ is missing. P.S. is it ok to call cudaFree() right after a kernel call? Will cudaFree block until the kernel’s finished? Just wondering (I know you’d normally have a memcpy anyway).

The intention is that the file “rand48.cu” is included into your file “MyCudaCode.cu”, so that the functions it defines are accessible. The nvcc compiler has uint2 as a built in type.

The CUDA documentation does not list cudaFree as asynchronous (section in my copy), so I expect that it blocks.


oh ok, thanks

The one I posted is an excellence combination of 4 Linear congruential generators. The great thing is that it is still accurate after having each thread having its own seed.

And the best thing is that I doesn’t need to write to global mem, just registers.

MT’s algorithm is complicated.

And yes, I was using it for a CUDAfied poisson generator that has a different mean parameter for every 3D point in the data

This code is very useful, thanks for posting.

There’s a small typo at the end:

device inline int rand48_nextFloat(Rand48 &r);

should be:

device inline float rand48_nextFloat(Rand48 &r);