What do you people use for a random number generator for the GPU? We want each thread to call a random number generator. If we use Mersenne Twister, it needs to store 624 integers, which each thread needs to access, making it slow. If we use something that doesn’t require any memory (like Park-Miller) then we also have problems, because that uses a mod (%) operator which is slow. Any suggestions?

By they way, we cant use the Cuda implementation of the Mersenne Twister, because we want to use one per thread.

For one of my projects on the GPU I used a multiply-with-carry random number generator, with each thread having both its individual seed and its individual multiplier. See my webpage “parallel random number generation”, which also contains a link to files listing suitable multipliers. As I say there, I haven’t run extensive tests into how independent all the generated numbers are.

To generate 32-bit random numbers, you can use “long long int”'s to do the multiplications. If you only need 24-bit ones and the period is sufficient, you can use some of the 24-bit multiply instructions to speed things up.

With variables like:

unsigned long long int x;
unsigned int tmpx;
unsigned int c;
unsigned int a;

This marsaglia PRNG looks perfect for my needs, however I’ve tried implementing it but I’m having a problem with storing the state. I’ve written the following code in a kernel:

//2^32
#define MAR_RAND_MAX 4294967296
uint index = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;
/*load PRNG state for each thread from global arrays initialised with files from here:
*http://www.ast.cam.ac.uk/~stg20/cuda/random/index.html
*
*Rx, Rc and Ra variables have type unsigned long long int *, unsigned int * and unsigned int *
*respectively
*/
unsigned long long int x = Rx[index];
unsigned int c = Rc[index];
unsigned int a = Ra[index];
//generate random numbers within range -1.0,1.0 in a loop (not shown)
x = x * a + c;
c = x >> 32;
x = x & 0xffffffffull;
unsigned int tmpx = (unsigned int)x;
float randVal = (float)tmpx / (float)MAR_RAND_MAX;
randVal = (randVal * 2.0f - 1.0);
//after the loop write updated PRNG state to global memory
Rx[index] = x;
Rc[index] = c;

This code works fine in device emulation mode, but when running on the GPU the Rx and Rc arrays do not seem to be updated with the new x and c values. Can anyone please tell me what I’m doing wrong?

Sorry, I only just came across your post. 64-bit ints seem to work fine on my 8800GTS 640MB (compute capability 1.0) under cuda 1.1 on linux 64-bit. I haven’t tried under cuda 2 yet. Have you checked the ptx output, does it seem to be doing a 64 bit multiply?

I’ve seen a strange error on my Tesla (cc 1.0) lately, which might be connected to what you experience.

I had some 64-bit integer arithmetics calculated in the way:

result = some_calculation + 1;

And after doing some debugging I found out, that the +1 never was calculated. However, rewriting to

result = some_calculation;
result++;

solved the problem. In Emu-mode both versions worked fine but on the board only the second one gave the correct result. I’m not really into ptx-files so I couldn’t check if the error is directly “implemented” at compile time or if the device has a problem with this calculation.

In an other thread I was told, that 64-bit arithmetics are not really predictable on compute capability 1.0 so I think we have to live with it. Maybe you should try to rewrite that line in some ways.

Thanks for your responses, it turned out that the problem was being caused by a bug elsewhere in the kernel (writing multiple times to the same location in global memory, in the same kernel). I still don’t fully understand why this affected the PRNG, but after fixing the bug, the random number generation code works perfectly.

VrahoK: I did notice in the ptx once that the compiler converted the 64bit int to a 32bit before doing the multiply. Unfortunately I can’t remember what caused this to occur. Did you try this?

result = (unsigned long long int)some_calculation + 1ull;

I’m intersted to code a random function and this one seems to be a good way to do it but I don’t understand how to initialize the input variables x , c and a. External Media

I declared three regular arrays, and then loaded the multipliers from the file into one of them. I used a random number generator to generate initial x’s and c’s, taking care to ensure that each c was less than the multiplier. These arrays are then copied to the card, and kernels receive appropriate pointers to them to load in their value.

Hi! As I said before I’ve applied it to a video stream and I’have noticed that some random pixels values are constants. I’ve tried to use different txt files but the result still the same .

Sorry, I can’t quite visualize the test you’re performing and so I’m not quite sure what you mean by some pixel values being constant.

What I meant by correlation was roughly that, if you write down two sequences belonging to two different multipliers, that knowing one gives you information about the other. An extreme example would be something like:

where the values are just offset and hence the two sequences aren’t really that different at all. This is bad in many applications since ideally where ideally we’d like each sequence to be truly independent (in the mathematical sense).