random number generation generating random numbers in CUDA

vpodlozhnyuk · September 4, 2007, 7:24pm

Probably no. At any moment the same state means the same output elements.

osiris1 · September 5, 2007, 2:01am

Yes, each with different matrixA and tempering.

In my case every block (or it could be every warp) or in the SDK case every thread.

Yes they produce a different output from the same state and the next state step is different as well.

Yes my implementation provides a safe expression of a single RNG across multiple threads where the number of concurrent outputs is limited by the size if the shift register and its feedback point (227 in this case). This gets you what you need for a lot of applications and uses the old faithful mt19937 that has been heavily used and tested for a decade now. You know what you are getting. If using dcmt it is new code (there is a strong health warning from the authors) and will require significant testing and measuring if using for any critical application. Nvidia did not stress this point. In the SDK a separate RNG is run for each thread, so that’s 12K separate MT607 RNGs (for 100% occupancy) that have to all be coprime to each other (well coprime generators) and I would want to be doing a lot of measurements to assure myself that it actually worked!

This is the change to the defines in the mt19937_ref.cu file for one of the dcmt 19937 generators - with M=312 the max number of concurrent outputs is N-M which is 312, significantly more than old faithful, allowing a useful block size of 256.

#ifdef DCMT

#define M               312

#define MATRIX_A        0x9efd0000

#define SHIFT0          12

#define MASKB           0x8d757780

#define MASKC           0xee558000

#else

#define M               397

#define MATRIX_A        0x9908b0df      /* Constant vector a */

#define SHIFT0          11

#define MASKB           0x9d2c5680

#define MASKC           0xefc60000

#endif

Note MASKB==TEMPER1 etc and SHIFT0 is the first tempering shift.

I have kicked off a generator run at low priority with p=19937 to see if I can get 48 for 100% occupancy and will post if it can find that many. Might take a week.

Eric

vpodlozhnyuk · September 5, 2007, 2:52pm

If you had used DCMT to spawn multiple generators with the period of 2 ^ 19937 - 1, then it’s definitly not the old good MT19937, which is single-threaded and has exactly these bitmask parameters:

#define MATRIX_A        0x9908b0df

#define      MASKB        0x9d2c5680

#define      MASKC        0xefc60000

At the same time the masks (and possibly some other parameters) are different for each of the DCMT-generated twisters, so they are also “unreliable”. The same state size alone (N = 624) doesn’t tell too much.

In the end, if you don’t trust DCMT, you can pick a different algorithm.

osiris1 · September 5, 2007, 10:57pm

Absolutely, one needs to be aware that it is not as well tested.

If you look above you will see that your defs are the else part of the ifdef DCMT above. Also the authors changed SHIFT0 from 11 to 12 for DCMT.

I have not looked at the theory in detail but I did check the code and SHIFT0, SHIFT1, SHIFTB and SHIFTC are all constants for all DCMT generators produced by version 0.4 (published last month). The only parameters that vary are matrixA and maskB and maskC across all IDs for a given prime so it is safe to leave them (SHIFT*) as defines at present. So my implementation of DCMT will have a small array in constant memory for the 48 3 word entries.

I still think DCMT is probably the best option, however my money is on using fewer RNGs with wider outputs compared to 1 per thread. For a start you only need 2 words of state per thread and the period of each is much longer. I am still waiting for DCMT to find one set of parameters for the monster - DCMT44497 that will allow 512 outputs in parallel (using my code) with a period more than 10^7300 times as long as MT19937. I have a few params for DCMT23209 but that just does not make 384 results in parallel.

Eric

ed: clarification & correction

osiris1 · September 7, 2007, 2:38am

Just for completeness, here is the dcmt19937 sample project. Everything is there, including the dynamic creator program. The dependency in the Makefile has been commented out, otherwise typing make won’t come back for up to a week - not what you would expect. At least you can see how it was made. If you do run the DC parameters you will get a different set that should be coprime to each other, not to the ones here.

There is no significant change to the library posted in the mt19937 project above, just a few parameters are indexed by blockIdx.x. They have been arranged in 1 array in constant storage so that it should have cost just 1 register to keep &dcmtp[blockIdx.x] around to directly reference everything, however the compiler got is knickers in a knot and used 2 or 3 registers (depending upon example) which blows 100% occupancy on all but the dcmt19937sk kernel.

As noted above one should use mt19937 wherever one RNG per block with different seeding is suitable for the application and use dcmt19937 where a single MC calculation is being spread over a number of blocks, bearing in mind that dcmt19937 is not as well tested - it should work, however you are advised to do whatever statistical checks are required for your application. While mt19937 is slightly faster, it can only calculate 227 results in parallel, whereas dcmt19937 can calculate 312 in parallel per RNG, though 256 is probably the most useful number. Hardwiring one set of parameters into dcmt19937 will get the best single RNG performance.

I think you would have to agree that it is a lot easier to see what is going on here than in the Nvidia SDK project.

Cheers,
Eric :)

[attachment=4212:attachment]
ed: thought it worthwhile to add the highest performing DCMT config to the end - extrapolates to 6.2e9 DCMT randoms/sec on an 8800 Ultra and that does include writing the results to global memory.

vpodlozhnyuk · September 7, 2007, 3:39pm

Just for completeness, here is the dcmt19937 sample project. Everything is there, including the dynamic creator program. The dependency in the Makefile has been commented out, otherwise typing make won’t come back for up to a week - not what you would expect. At least you can see how it was made. If you do run the DC parameters you will get a different set that should be coprime to each other, not to the ones here.

There is no significant change to the library posted in the mt19937 project above, just a few parameters are indexed by blockIdx.x. They have been arranged in 1 array in constant storage so that it should have cost just 1 register to keep &dcmtp[blockIdx.x] around to directly reference everything, however the compiler got is knickers in a knot and used 2 or 3 registers (depending upon example) which blows 100% occupancy on all but the dcmt19937sk kernel.

As noted above one should use mt19937 wherever one RNG per block with different seeding is suitable for the application and use dcmt19937 where a single MC calculation is being spread over a number of blocks, bearing in mind that dcmt19937 is not as well tested - it should work, however you are advised to do whatever statistical checks are required for your application. While mt19937 is slightly faster, it can only calculate 227 results in parallel, whereas dcmt19937 can calculate 312 in parallel per RNG, though 256 is probably the most useful number. Hardwiring one set of parameters into dcmt19937 will get the best single RNG performance.

I think you would have to agree that it is a lot easier to see what is going on here than in the Nvidia SDK project.

Cheers,

Eric :)

[attachment=5551:attachment]

ed: thought it worthwhile to add the highest performing DCMT config to the end - extrapolates to 6.2e9 DCMT randoms/sec on an 8800 Ultra and that does include writing the results to global memory.

[snapback]247608[/snapback]

Nice job, Eric!

However after a brief look I have a question: does your implementation rely on the “fact” that shared memory content (those extern shared arrays defined in global scope) is preserved between different kernel invocations?

Please note that the “fact” is untrue, shared variables are still local to thread blocks: by the model, their values are undefined at thread block launch and you must flush them to global memory if you want them to be saved for subsequent kernels. Even if it appears to be working, neither hardware nor software guarantee such behaviour. In fact, shared memory is used internally by graphics, so it will be “corrupted” almost for sure if one or more 3D contexts exist on the same GPU and a context switch took place between the kernel calls.

osiris1 · September 8, 2007, 4:32am

Thanks, Victor.

I think you are a little confused and did not look at the actual sample kernels (in dcmt19937_kernel.cu) where the initialisation routine mt19937si() is always called before using any of the shared memory state versions, within the kernel it is used. Patrik also did not look at the sample __global__s. More doco required?

The dcmt19937_ref.cu file is designed to be included into any other project and the compiler will inline any routines that you use. Just one file required that is its own header - had to separate the dcmtparam19937.inc file because device host constant is broken, or NYI at present and it simplified the make file.

Eric

ed: Perhaps all the symbols need to be made longer to avoid confusion…

osiris1 · September 15, 2007, 8:47am

Just a quick note to say that the library above is not yet certified for all Mersenne prime exponents. It is correct as posted, however verification of DCMT44497 revealed a race condition at 696 threads. A new version with documentation and comments all parameterised for Mersenne exponent is ready, however it is on hold until V1.1 as it seems to be almost ready and the project may require some changes.

Eric

lutorm · September 18, 2007, 8:55pm

Nice! I started calculating the MT19937 parameters myself and it will indeed take a while…

However, I’m not sure you are correct that rerunning will yield parameters that are not coprime to those you have. As I understand the paper, generators with different a are distinct. By embedding the ID number in a, you necessarily get parameters that are coprime to those with other IDs. Rerunning will give you parameters that are coprime to each other but (likely) also to the ones you have, as long as for a given ID, the a is different. Your a’s start 0xcc420000, 0xabbe0001. Mine start 0x8c950000, 0x8c740001. That means these sets are coprime too, if I understand correctly.

osiris1 · September 19, 2007, 12:14am

Yes, I left out the word “necessarily” - trying to keep it simple. My understanding is that the matrixAs only need to be distinct. The generation algorithm takes your ID puts it into the LSBs selects a random set of top bits (using mt19937) eliminates obvious reducible polynomials then runs a test for full periodicity that is what takes all the time. I even looked at porting this process quickly, only 2 for loops but their range would require everything from global memory and the upper levels of the code would have needed hacking as well. Much cheaper to fire off a bunch of CPUs and manually collate the results - old fashioned parallel processing. The update has 256 sets for DCMT19937 which is probably the best size for G80 - also now have > 16 sets for DCMT44497 which is more appropriate for a bigger architecture - some of these took a CPU week each.

Eric

lord_jake · October 15, 2007, 3:28pm

Dear all,

by chance we found this discussion about a quick congruential shift random number generator. We happen to have just implemented the GNU rand48 for CUDA. It gives EXACTLY the same random numbers as the standard C library, just about 150 times faster :-) If you are looking for a reasonably good RNG for e.g. a many-particle MC or MD, you might want to have a look at our
MD on a GPU webpage.

Hope it helps!

PS: I cannot recommend any RNGs taken from Numerical Recipes like previous posters already mentioned, as they are known to perform exceptionally poor!

wumpus · October 16, 2007, 2:05pm

lord_jake: Your download gives a file not found error: [url=“http://www.amolf.nl/~vanmeel/mdgpu/files/mdgpu_rng.tar.gz”]http://www.amolf.nl/~vanmeel/mdgpu/files/mdgpu_rng.tar.gz[/url]

asadafag · October 17, 2007, 6:03am

To lord_jake:
0. What problem do you intend to solve using MC/MD?

Why does a MD simulation need random numbers in the first place? Isn’t it fully deterministic?
For a MC, how do you parallelize it? Does your algorithm need independent RNG for each thread?
Isn’t the 2^48 period too short in a scientific project?

MisterAnderson42 · October 17, 2007, 2:04pm

Nose-Hoover and related thermostats are fully deterministic. But there is also a Langevin thermostat sometimes used in MD that uses random forces.

asadafag · October 18, 2007, 3:15am

Thanks for the information.

lord_jake · October 22, 2007, 8:11am

Whooopsy. I’m terribly sorry.

Thx for the info. It is fixed now…

lord_jake · October 22, 2007, 8:22am

@0:

Many particle simulations, for instance for soft matter research (polymers, colloids, DNA,

biological macromolecules, etc.).

@1:

There are many problems in MD for which you need a RNG. Just take

for example a many-particle system where you keep the temperature constant.

For these systems you need a thermostat. Some of them are fully deterministic,

but many of these are stochastic, which means they use lots of random numbers.

@2:

As far as I know MC algorithms are only parallelizable in an efficient manner if they

are lattice-based or if moves can be applied locally, which is not trivial. Whether your

algorithm requires an independent RNG per thread or not depends typically on your

taste. There are real flamewars (or their pendant in science) going on about this topic.

@3:

For many particle systems, which generally are chaotic, the length of the RNG period

as well as its ‘randomness’ have minor impact on the system behavior. It is more

critical if your system has less dynamics off its own and depends heavily on the RNG.

Only if it has a really small period, like 2^32 = 16 mio, this can cause troubles for many

particle systems, because for a system of 1 mio particles, this means that the cycle is

repeated every 5-6 time steps (3 numbers per particle per time step required).

Jake

FangQ · February 20, 2009, 3:52pm

hi

I am looking for a random number generator for GPU (multi-thread situation) for my Monte-Carlo simulation. The codes in this thread seem to be very promising and cleaner than the MT example with CUDA SDK. But there a bunch of parameters are not clear to me:

what does N mean? MT_THREADS=312, does that mean I can only use for no more than 312 threads? what is"tid"? and the return value ret is int2, are they both random integers?

it might be really useful if someone can provide a working example using the enclosed subroutine, it would be much easier for beginners like me to digest. Anyone want to share?

thanks

I’ve got an implementation that generates over 3 billion MT random numbers/second on the G80 (about 2.7 billion/second random floats between 0 and 1)… here is the main bit of code:

#define N (624)

#define MT_THREADS 312

#define M (397)

#define MATRIX_A 0x9908b0dfUL Â  /* constant vector a */

#define UPPER_MASK 0x80000000UL /* most significant w-r bits */

#define LOWER_MASK 0x7fffffffUL /* least significant r bits */

#ifdef _WIN32

typedef unsigned int uint;

#endif //_WIN32

__shared__ uint g_seeds[624];

__device__ uint mag01(uint fs) {

 Â  Â return fs * MATRIX_A;

}

__device__ uint2 mt_gen_rand(uint tid) { 

 Â  Â int kk;

 Â  Â uint y;

 Â  Â uint2 ret;

Â  Â __syncthreads();

Â  Â kk = tid;

 Â  Â //first 227

 Â  Â if(kk < 227) {

 Â  Â  Â  Â y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

 Â  Â  Â  Â g_seeds[kk] = g_seeds[kk+M] ^ (y >> 1) ^ mag01(y & 0x1UL);

 Â  Â }

 Â  Â kk += 227; Â  Â 

 Â  Â __syncthreads();

 Â  Â //next 192

 Â  Â if(kk >= 227 && kk < 419) {

 Â  Â  Â  Â y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

 Â  Â  Â  Â g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1) ^ mag01(y & 0x1UL);

 Â  Â }

 Â  Â kk += 192;

 Â  Â __syncthreads();

 Â  Â //last 205

 Â  Â if(kk >= 419 && kk < 623) {

 Â  Â  Â  Â y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

 Â  Â  Â  Â g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1) ^ mag01(y & 0x1UL);

 Â  Â }

 Â  Â if(kk == 623) {

 Â  Â  Â  Â y = (g_seeds[N-1]&UPPER_MASK)|(g_seeds[0]&LOWER_MASK);

 Â  Â  Â  Â g_seeds[N-1] = g_seeds[M-1] ^ (y >> 1) ^ mag01(y & 0x1UL);

 Â  Â }

 Â  Â 

 Â  Â __syncthreads();

Â  Â ret.x = g_seeds[2*tid];

 Â  Â 

 Â  Â  Â  /* Tempering */

 Â  Â ret.x ^= (ret.x >> 11);

 Â  Â ret.x ^= (ret.x << 7) & 0x9d2c5680UL;

 Â  Â ret.x ^= (ret.x << 15) & 0xefc60000UL;

 Â  Â ret.x ^= (ret.x >> 18);

 Â  Â 

 Â  Â ret.y = g_seeds[2*tid+1];

 Â  Â 

 Â  Â  Â  /* Tempering */

 Â  Â ret.y ^= (ret.y >> 11);

 Â  Â ret.y ^= (ret.y << 7) & 0x9d2c5680UL;

 Â  Â ret.y ^= (ret.y << 15) & 0xefc60000UL;

 Â  Â ret.y ^= (ret.y >> 18);

Â  Â return ret;

}

FangQ · February 20, 2009, 4:23pm

I think my question wasn’t very clear. What I want is a device type of kernel that returns a random float (or float2 or float3) between 0 and 1, something like rand() in C, but for each GPU thread. I will call this function inside a global kernel for 0 to 3 times per call.

The MT example seems fill a buffer with multiple threads, I don’t really need to store all the random numbers.

Any advice?

FangQ · February 21, 2009, 1:09am

I put together a few subroutines found in this thread, and try to write a simple test program to verify the MT RNG, however, I am getting very weird results with lots of zeros.

Can someone give me a hand on this? I really appreciate your help.

also, I may want to run my calculation over a different number of threads (i.e. MAXN), in the cases where MAXN!=MT_THREADS, what will happen? how can I still use this RNG?

thanks

// random number generator test code based on MT19937

//

// http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html

// Copyright Â© 1997 - 2002, Makoto Matsumoto and Takuji Nishimura

// Copyright Â© 2005, Mutsuo Saito

#include <stdio.h>

#define MT_THREADS		 312

#define N				  (2*MT_THREADS)

#define M				  397

#define MATRIX_A 0x9908b0dfUL   /* constant vector a */

#define UPPER_MASK 0x80000000UL /* most significant w-r bits */

#define LOWER_MASK 0x7fffffffUL /* least significant r bits */

#define MAX_RAND_F 2147483647.f

#define MAXN	312

#ifdef _WIN32

typedef unsigned int uint;

#endif //_WIN32

__shared__ static uint g_seeds[N];

__device__ uint mag01(uint fs) {

   return fs * MATRIX_A;

}

// generate random numbers from the seeds

__device__ uint2 mt_gen_rand(uint tid) {

   int kk;

   uint y;

   uint2 ret;

__syncthreads();

kk = tid;

   //first 227

   if(kk < 227) {

	   y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

	   g_seeds[kk] = g_seeds[kk+M] ^ (y >> 1) ^ mag01(y & 0x1UL);

   }

   kk += 227;	

   __syncthreads();

   //next 192

   if(kk >= 227 && kk < 419) {

	   y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

	   g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1) ^ mag01(y & 0x1UL);

   }

   kk += 192;

   __syncthreads();

   //last 205

   if(kk >= 419 && kk < 623) {

	   y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

	   g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1) ^ mag01(y & 0x1UL);

   }

   if(kk == 623) {

	   y = (g_seeds[N-1]&UPPER_MASK)|(g_seeds[0]&LOWER_MASK);

	   g_seeds[N-1] = g_seeds[M-1] ^ (y >> 1) ^ mag01(y & 0x1UL);

   }

__syncthreads();

ret.x = g_seeds[2*tid];

	  /* Tempering */

   ret.x ^= (ret.x >> 11);

   ret.x ^= (ret.x << 7) & 0x9d2c5680UL;

   ret.x ^= (ret.x << 15) & 0xefc60000UL;

   ret.x ^= (ret.x >> 18);

ret.y = g_seeds[2*tid+1];

	  /* Tempering */

   ret.y ^= (ret.y >> 11);

   ret.y ^= (ret.y << 7) & 0x9d2c5680UL;

   ret.y ^= (ret.y << 15) & 0xefc60000UL;

   ret.y ^= (ret.y >> 18);

return ret;

}

// read states from seeds[] to the shared memory

__device__ void mt_resume(uint *seeds, uint tid) {

   uint twotid = tid<<1;

   uint totalid = blockIdx.x * N + twotid;

   g_seeds[twotid] = seeds[totalid];

   g_seeds[twotid+1] = seeds[totalid+1];

   __syncthreads();

}

// save states from shared memory to seeds[]

__device__ void mt_stop(uint *seeds, uint tid) {

   __syncthreads();

   uint twotid = tid<<1;

   uint totalid = blockIdx.x * N + twotid;

   seeds[totalid] = g_seeds[twotid];

   seeds[totalid+1] = g_seeds[twotid+1];

}

// initialize all the seeds in the shared memory

__global__ void mt_rand_init(uint *seeds) {

   uint tid = threadIdx.x;

   if(tid == 0) {

	   g_seeds[0] = 1 + blockIdx.x;

	   for (uint i = 1; i < N; ++i) {

		   g_seeds[i] = (1812433253UL * (g_seeds[i-1] ^ (g_seeds[i-1] >> 30)) + i);

	   }

   }

   mt_stop(seeds, tid);

}

// test kernel, generate a pair of random numbers and do some other calculations

__global__ void genrand(uint seed[],float2 ran[]){

	int idx= blockDim.x * blockIdx.x + threadIdx.x;

	uint2 irand;

	mt_resume(seed,idx);

	irand=mt_gen_rand(idx);

	ran[idx].x=(float)irand.x/MAX_RAND_F;

	ran[idx].y=(float)irand.y/MAX_RAND_F;

	// now you can use the random numbers to do your calculation

	mt_stop(seed,idx);

}

//======================================================

main(){

float2 ran[MAXN];

  uint  seed[N];

  dim3 gsize(MAXN/N);

  dim3 bsize(N);

float2 *gran;

  cudaMalloc((void **) &gran, sizeof(float2)*(MAXN));

  uint *gseed;

  cudaMalloc((void **) &gseed, sizeof(uint)*(N));

for(int i=0;i<N;i++)

	seed[i]=rand();

	

  cudaMemcpy(gran,  ran,  sizeof(float2)*MAXN,  cudaMemcpyHostToDevice);

  cudaMemcpy(gseed, seed,  sizeof(uint)*N,  cudaMemcpyHostToDevice);

mt_rand_init<<<gsize,bsize>>>(gseed); // load seed to the shared memory

  for(int i=0;i<10;i++){

	 genrand<<<gsize,bsize>>>(gseed,gran);  // resume seed from global memory to shared memory

  }

cudaMemcpy(ran,  gran, sizeof(float2)*MAXN, cudaMemcpyDeviceToHost);

  cudaMemcpy(seed,  gseed, sizeof(uint)*MAXN, cudaMemcpyDeviceToHost);

for(int i=0;i<MAXN;i++){

	  printf("%f\t%f\n",ran[i].x,ran[i].y);

  }   

}

Topic		Replies	Views
random numbers inside the Kernel CUDA Programming and Performance	31	60701	November 26, 2011
Another Random Number Thread CUDA Programming and Performance	15	3346	August 13, 2009
Requesting advice on kernel efficiency CUDA Programming and Performance	14	673	October 10, 2019
one kernel two codes - cuda arch 1.1 CUDA Programming and Performance	32	2932	June 19, 2011
Global thread barrier CUDA Programming and Performance	78	85678	December 23, 2011
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	623	June 3, 2024
uncorrelated random numbers CUDA Programming and Performance	15	5406	June 29, 2011
Kernel runs successfully a few times, then crashes CUDA Programming and Performance	2	1832	August 18, 2009
atomicadd for double precision in CUDA Fortran Legacy PGI Compilers	20	21605	November 15, 2013
Where's a PTX ISA Virtual Grid ID? Special Reg %gridid is "temporal".. CUDA Programming and Performance	7	2353	January 23, 2012

random number generation generating random numbers in CUDA

Related topics