random number generators

Hi all,

What do you people use for a random number generator for the GPU? We want each thread to call a random number generator. If we use Mersenne Twister, it needs to store 624 integers, which each thread needs to access, making it slow. If we use something that doesn’t require any memory (like Park-Miller) then we also have problems, because that uses a mod (%) operator which is slow. Any suggestions?

By they way, we cant use the Cuda implementation of the Mersenne Twister, because we want to use one per thread.

Skip down to George Marsaglia’s post here:


Whether that’s “good enough” is all a matter of degrees. :-)

As an aside, Park-Miller might work in this version:


Hi there,

For one of my projects on the GPU I used a multiply-with-carry random number generator, with each thread having both its individual seed and its individual multiplier. See my webpage “parallel random number generation”, which also contains a link to files listing suitable multipliers. As I say there, I haven’t run extensive tests into how independent all the generated numbers are.

To generate 32-bit random numbers, you can use “long long int”'s to do the multiplications. If you only need 24-bit ones and the period is sufficient, you can use some of the 24-bit multiply instructions to speed things up.

With variables like:

unsigned long long int x;
unsigned int tmpx;
unsigned int c;
unsigned int a;

appropriately initialized, looping over:


should generate a sequence in tmpx.


This marsaglia PRNG looks perfect for my needs, however I’ve tried implementing it but I’m having a problem with storing the state. I’ve written the following code in a kernel:


#define MAR_RAND_MAX 4294967296

uint index = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

/*load PRNG state for each thread from global arrays initialised with files from here:



*Rx, Rc and Ra variables have type unsigned long long int *, unsigned int * and unsigned int * 



unsigned long long int x = Rx[index];

unsigned int c = Rc[index];

unsigned int a = Ra[index];

//generate random numbers within range -1.0,1.0 in a loop (not shown)

x = x * a + c;

c = x >> 32;	

x = x & 0xffffffffull;

unsigned int tmpx = (unsigned int)x;

float randVal = (float)tmpx / (float)MAR_RAND_MAX;

randVal = (randVal * 2.0f - 1.0);

//after the loop write updated PRNG state to global memory

Rx[index] = x;

Rc[index] = c;

This code works fine in device emulation mode, but when running on the GPU the Rx and Rc arrays do not seem to be updated with the new x and c values. Can anyone please tell me what I’m doing wrong?

Im using cuda 2 beta 2 on a 8800 GTX



actually, after more exploration, it seems that the write to global mem is occuring but the multiplication:

x = x * a + c;

does not seem to occur. Is multiplications of 64 bit ints supported in compute capability 1.0?

Hi there,

Sorry, I only just came across your post. 64-bit ints seem to work fine on my 8800GTS 640MB (compute capability 1.0) under cuda 1.1 on linux 64-bit. I haven’t tried under cuda 2 yet. Have you checked the ptx output, does it seem to be doing a 64 bit multiply?


I’ve seen a strange error on my Tesla (cc 1.0) lately, which might be connected to what you experience.

I had some 64-bit integer arithmetics calculated in the way:

result = some_calculation + 1;

And after doing some debugging I found out, that the +1 never was calculated. However, rewriting to

result = some_calculation;

solved the problem. In Emu-mode both versions worked fine but on the board only the second one gave the correct result. I’m not really into ptx-files so I couldn’t check if the error is directly “implemented” at compile time or if the device has a problem with this calculation.

In an other thread I was told, that 64-bit arithmetics are not really predictable on compute capability 1.0 so I think we have to live with it. Maybe you should try to rewrite that line in some ways.


Thanks for your responses, it turned out that the problem was being caused by a bug elsewhere in the kernel (writing multiple times to the same location in global memory, in the same kernel). I still don’t fully understand why this affected the PRNG, but after fixing the bug, the random number generation code works perfectly.

VrahoK: I did notice in the ptx once that the compiler converted the 64bit int to a 32bit before doing the multiply. Unfortunately I can’t remember what caused this to occur. Did you try this?

result = (unsigned long long int)some_calculation + 1ull;



Hi everyone ! :)

I’m intersted to code a random function and this one seems to be a good way to do it but I don’t understand how to initialize the input variables x , c and a. :oops:

Thanks for your time and help!

Hi there,

I declared three regular arrays, and then loaded the multipliers from the file into one of them. I used a random number generator to generate initial x’s and c’s, taking care to ensure that each c was less than the multiplier. These arrays are then copied to the card, and kernels receive appropriate pointers to them to load in their value.

See e.g. the code here.

If you have any issues, in particular if you find too much correlation between the streams, please do let me know!

Thanks a lot,

Thank you very much I will tell you that as soon as I could ;)

edit : I’ve tried the algorithm on a video and it works without any strong correlation between two frames

Hi! As I said before I’ve applied it to a video stream and I’have noticed that some random pixels values are constants. I’ve tried to use different txt files but the result still the same .

Is that what you wanted to know Steven?

Hi there,

Sorry, I can’t quite visualize the test you’re performing and so I’m not quite sure what you mean by some pixel values being constant.

What I meant by correlation was roughly that, if you write down two sequences belonging to two different multipliers, that knowing one gives you information about the other. An extreme example would be something like:

sequence 1: … 1 4 3 6 9 2 2 …
sequence 2: … 3 6 9 2 2 5 4 …

where the values are just offset and hence the two sequences aren’t really that different at all. This is bad in many applications since ideally where ideally we’d like each sequence to be truly independent (in the mathematical sense).