random number generators

dimkadimon · June 25, 2008, 3:35am

Hi all,

What do you people use for a random number generator for the GPU? We want each thread to call a random number generator. If we use Mersenne Twister, it needs to store 624 integers, which each thread needs to access, making it slow. If we use something that doesn’t require any memory (like Park-Miller) then we also have problems, because that uses a mod (%) operator which is slow. Any suggestions?

By they way, we cant use the Cuda implementation of the Mersenne Twister, because we want to use one per thread.

FullyArticulate · June 25, 2008, 11:45pm

Skip down to George Marsaglia’s post here:

http://groups.google.com/group/comp.lang.c…915080a4424068/

Whether that’s “good enough” is all a matter of degrees. :-)

As an aside, Park-Miller might work in this version:

http://www.firstpr.com.au/dsp/rand31/

sgratton · June 27, 2008, 10:31am

Hi there,

For one of my projects on the GPU I used a multiply-with-carry random number generator, with each thread having both its individual seed and its individual multiplier. See my webpage “parallel random number generation”, which also contains a link to files listing suitable multipliers. As I say there, I haven’t run extensive tests into how independent all the generated numbers are.

To generate 32-bit random numbers, you can use “long long int”'s to do the multiplications. If you only need 24-bit ones and the period is sufficient, you can use some of the 24-bit multiply instructions to speed things up.

With variables like:

unsigned long long int x;
unsigned int tmpx;
unsigned int c;
unsigned int a;

appropriately initialized, looping over:

x=x*a+c;
c=(x>>32);
x=x&0xffffffffull;
tmpx=x;

should generate a sequence in tmpx.

Best,
Steven.

nodlams · July 25, 2008, 10:53am

This marsaglia PRNG looks perfect for my needs, however I’ve tried implementing it but I’m having a problem with storing the state. I’ve written the following code in a kernel:

//2^32

#define MAR_RAND_MAX 4294967296

uint index = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

/*load PRNG state for each thread from global arrays initialised with files from here:

*http://www.ast.cam.ac.uk/~stg20/cuda/random/index.html

*

*Rx, Rc and Ra variables have type unsigned long long int *, unsigned int * and unsigned int * 

*respectively

*/

unsigned long long int x = Rx[index];

unsigned int c = Rc[index];

unsigned int a = Ra[index];

//generate random numbers within range -1.0,1.0 in a loop (not shown)

x = x * a + c;

c = x >> 32;	

x = x & 0xffffffffull;

unsigned int tmpx = (unsigned int)x;

float randVal = (float)tmpx / (float)MAR_RAND_MAX;

randVal = (randVal * 2.0f - 1.0);

//after the loop write updated PRNG state to global memory

Rx[index] = x;

Rc[index] = c;

This code works fine in device emulation mode, but when running on the GPU the Rx and Rc arrays do not seem to be updated with the new x and c values. Can anyone please tell me what I’m doing wrong?

Im using cuda 2 beta 2 on a 8800 GTX

Cheers

nodlams

nodlams · July 25, 2008, 3:08pm

actually, after more exploration, it seems that the write to global mem is occuring but the multiplication:

x = x * a + c;

does not seem to occur. Is multiplications of 64 bit ints supported in compute capability 1.0?

sgratton · July 29, 2008, 4:22pm

Hi there,

Sorry, I only just came across your post. 64-bit ints seem to work fine on my 8800GTS 640MB (compute capability 1.0) under cuda 1.1 on linux 64-bit. I haven’t tried under cuda 2 yet. Have you checked the ptx output, does it seem to be doing a 64 bit multiply?

Best,
Steven.

VrahoK · July 30, 2008, 12:57pm

I’ve seen a strange error on my Tesla (cc 1.0) lately, which might be connected to what you experience.

I had some 64-bit integer arithmetics calculated in the way:

result = some_calculation + 1;

And after doing some debugging I found out, that the +1 never was calculated. However, rewriting to

result = some_calculation;
result++;

solved the problem. In Emu-mode both versions worked fine but on the board only the second one gave the correct result. I’m not really into ptx-files so I couldn’t check if the error is directly “implemented” at compile time or if the device has a problem with this calculation.

In an other thread I was told, that 64-bit arithmetics are not really predictable on compute capability 1.0 so I think we have to live with it. Maybe you should try to rewrite that line in some ways.

Vrah

nodlams · July 31, 2008, 2:03pm

Thanks for your responses, it turned out that the problem was being caused by a bug elsewhere in the kernel (writing multiple times to the same location in global memory, in the same kernel). I still don’t fully understand why this affected the PRNG, but after fixing the bug, the random number generation code works perfectly.

VrahoK: I did notice in the ptx once that the compiler converted the 64bit int to a 32bit before doing the multiply. Unfortunately I can’t remember what caused this to occur. Did you try this?

result = (unsigned long long int)some_calculation + 1ull;

Cheers

Nodlams

Ced_39 · August 6, 2008, 7:09am

This marsaglia PRNG looks perfect for my needs, however I’ve tried implementing it but I’m having a problem with storing the state. I’ve written the following code in a kernel:
//2^32

#define MAR_RAND_MAX 4294967296

uint index = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

/*load PRNG state for each thread from global arrays initialised with files from here:

*http://www.ast.cam.ac.uk/~stg20/cuda/random/index.html

*

*Rx, Rc and Ra variables have type unsigned long long int *, unsigned int * and unsigned int * 

*respectively

*/

unsigned long long int x = Rx[index];

unsigned int c = Rc[index];

unsigned int a = Ra[index];

//generate random numbers within range -1.0,1.0 in a loop (not shown)

x = x * a + c;

c = x >> 32;	

x = x & 0xffffffffull;

unsigned int tmpx = (unsigned int)x;

float randVal = (float)tmpx / (float)MAR_RAND_MAX;

randVal = (randVal * 2.0f - 1.0);

//after the loop write updated PRNG state to global memory

Rx[index] = x;

Rc[index] = c;
This code works fine in device emulation mode, but when running on the GPU the Rx and Rc arrays do not seem to be updated with the new x and c values. Can anyone please tell me what I’m doing wrong?

Im using cuda 2 beta 2 on a 8800 GTX

Cheers

nodlams

[snapback]415785[/snapback]

Hi everyone ! :)

I’m intersted to code a random function and this one seems to be a good way to do it but I don’t understand how to initialize the input variables x , c and a. External Media

Thanks for your time and help!

sgratton · August 6, 2008, 8:57am

Hi there,

I declared three regular arrays, and then loaded the multipliers from the file into one of them. I used a random number generator to generate initial x’s and c’s, taking care to ensure that each c was less than the multiplier. These arrays are then copied to the card, and kernels receive appropriate pointers to them to load in their value.

See e.g. the code here.

If you have any issues, in particular if you find too much correlation between the streams, please do let me know!

Thanks a lot,
Steven.

Ced_39 · August 6, 2008, 12:07pm

Thank you very much I will tell you that as soon as I could ;)

edit : I’ve tried the algorithm on a video and it works without any strong correlation between two frames

Ced_39 · August 7, 2008, 7:57am

Hi! As I said before I’ve applied it to a video stream and I’have noticed that some random pixels values are constants. I’ve tried to use different txt files but the result still the same .

Is that what you wanted to know Steven?

sgratton · August 7, 2008, 11:54am

Hi there,

Sorry, I can’t quite visualize the test you’re performing and so I’m not quite sure what you mean by some pixel values being constant.

What I meant by correlation was roughly that, if you write down two sequences belonging to two different multipliers, that knowing one gives you information about the other. An extreme example would be something like:

sequence 1: … 1 4 3 6 9 2 2 …
sequence 2: … 3 6 9 2 2 5 4 …
…

where the values are just offset and hence the two sequences aren’t really that different at all. This is bad in many applications since ideally where ideally we’d like each sequence to be truly independent (in the mathematical sense).

Best,
Steven.

Topic		Replies	Views
random numbers inside the Kernel CUDA Programming and Performance	31	61175	November 26, 2011
on generating random numbers in parallel CUDA Programming and Performance	4	9208	November 28, 2011
What's a good random number generator? CUDA Programming and Performance	21	13405	May 6, 2009
random number generation generating random numbers in CUDA CUDA Programming and Performance	64	110227	January 25, 2011
simple rand() a simple rand function on gpu exist? CUDA Programming and Performance	9	5969	June 1, 2009
random number generation on each thread? random number generation rng CUDA Programming and Performance	3	6305	August 26, 2008
Mersenne Twister SDK: what's going on? CUDA Programming and Performance	10	17030	January 19, 2011
MersenneTwister application CUDA Programming and Performance	0	4069	May 28, 2008
Random Number Generator on CUDA CUDA Programming and Performance	6	44968	May 15, 2009
MersenneTwister application one more question MersenneTwister application one more.... CUDA Programming and Performance	0	2528	May 30, 2008

random number generators

Related topics