random number generation generating random numbers in CUDA

I’m trying to generate a random seed value for the srand function in a kernel I’m running, but I keep getting an error. This code:

srand(time(NULL));

float r = rand();

produces the error:

“.\simpleGL_kernel.cu”, line 67: error: expression must have (pointer-to-)

      function type

srand(time(0));

Is the time function not accessible in a kernel? I wasn’t having any problems with srand or rand, only time.

If time is not usable, does anyone have a suggestion for generating random seeds?

-skiz

Pretty sure time() wouldn’t be callable in a kernel. You’ll probably have to call time() from the host once for each thread and upload the results to the kernel.

And even if time() did work, you shouldn’t use it if you want have a different seed for each thread. (For the same reason, clock() won’t work either.)

Other things to be worried about:

  • rand() is usually implemented poorly in most C libraries with a short period and other undesireable qualities. You probably want to know more about the implementation of rand() for CUDA if you need high quality random numbers (for Monte Carlo algorithms, etc).

  • Not all random number generators stay random when used in parallel. That is, a collection of 1 million random numbers generated by a given algorithm with 1 seed might pass all randomness tests. However, a collection of 1 million random numbers produced by the same algorithm generating 100 numbers each with 10,000 seeds might not be so random. The latter case is what happens when you generate random numbers in each thread independently.

Anyway, this isn’t to say rand() won’t work for you, but that you might want to be careful.

There’s an implementation of rand() for CUDA?

Thanks for the comments. I think rand() should be enough for my purposes, but if not I’ll figure something out from there.

As an aside, would the rand() implementation for CUDA be different from the C implementation?

Friends don’t let friends use rand()… :-)

See L’Ecuyer and Simard’s TestU01 paper:

http://www.iro.umontreal.ca/~simardr/testu01/tu01.html

Several of the xorshift based RNGs look like good candidates for GPUs.

I have need for decent RNGs in my own CUDA code but I haven’t yet had

a chance to go through the statistically sound RNGs in the TestU01 suite

and measure their performance on the GPU. Yet another thing I hope to have

time to work on in a month or so…

Cheers,

John Stone

This would be awesome, especially if you can find a seeding strategy that lets you use the RNG safely on 128 processors. One strategy that is sometimes used is the “leapfrog” method, where a common seed is used, but each thread starts at a different offset in the sequence. Each time a number is requested, the thread-local RNG jumps ahead by N, where N is the number of threads, so each thread gets a disjoint subset of the random number stream. This is really slow, unfortunately, though maybe if these xorshift RNGs are fast enough, burning some clock cycles to get uncorrelated random number sequences won’t be a problem.

I’ve implemented a highly efficient (if not highly “random”) RNG pinched from Numerical Recipes:

/**

* Generate a uniform psuedo-random number [0,1).

* See "Numerical Recipes" for details.

*/

__device__ float randu(unsigned int tid)

{

  #define MULTIPLIER  ((unsigned int) 1664525)

  #define OFFSET  ((unsigned int) 1013904223)

  #define MODULUS  ((double) 4294967296.0)	// 2^32 as a double

  #define MODULUS_INV	((float) (1.0 / MODULUS))  // let compiler operate on doubles and then cast to float, or manually enter the ratio as a float constant:  2.328306e-10f

  

  unsigned int sNew = SEED(tid) * MULTIPLIER + OFFSET;

  SEED(tid) = sNew;

  float res = sNew * MODULUS_INV;

  return res;

}

This is highly efficient in its arithmetic operations. The GeForce 8800 series appears to overflow unsigned ints in a “graceful” way, i.e., mod 2^32, avoiding the use of the clock-expensive mod operator.

Here, SEED(tid) is a macro to refer to an array of seeds in shared memory, for example:

extern __shared__ unsigned int seeds[];

#define SEED(i)	CUT_BANK_CHECKER(seeds, i)

//#define SEED(i)	seeds[i]

The seeds reside in shared memory so they can be updated from call to call. This requires 2KB of shared memory for 512 thread blocks, which is tolerable for most applications.

I pass the initial seeds to the kernel in global memory array – one seed for every thread in every thread block – and initialize the RNG from there:

/**

* Test monte-carlo simulation.

*/

__global__ void testMonteCarlo(unsigned int* g_seeds, ...)

{

  const unsigned int num_threads = blockDim.x;

 const unsigned int tid = threadIdx.x;

  const unsigned int bid = blockIdx.y * gridDim.x + blockIdx.x;

 // initialize the seeds from global memory for this block

  SEED(tid) = g_seeds[bid * num_threads + tid];

  ...

}

While this linear congruential RNG is not perfect, it’s better than the old RANDU, and the speed and simplicity outweighs its lack of randomness for most purposes.

Passed through a Box-Muller transform for Gaussian RNG, I’ve attained speeds over 4 billion Monte Carlo steps per second on a GeForce 8800 GTS.

I think I might take a crack at implementing one of these. I can post results if I end up using it. It is for a research project for class, so doing some RNG research as well would be a nice add in. Plus, it seems like something the community could use. Not like I’m an expert or anything, but I’ll give it a shot.

skiz

Yeah, I think it would be a fun project to take the best RNGs from TestU01 and implement them on the GPU. The best candidates are going to be ones that have tiny state vectors and avoid modulo and other operations that the GPU hardware doesn’t like…

Yeah, I’ve tried variations of the leapfrog method and besides the speed issue, you want to have a long period RNG to start with if you intend to dice up the RN stream significantly like I was doing…

John

I too am evaluating/designing a port onto the G80 where being able to move RNG would help justify its use. See also [url=“http://forums.nvidia.com/index.php?showtopic=30169”]http://forums.nvidia.com/index.php?showtopic=30169[/url] (rev link).

The linear congruential example above is relatively low performance (short period) and would require the use of the leapfrog method here. Not really suitable for serious work.

The Multiply-With-Carry Generator mentioned in the above link has much better period but also would require using the leapfrog method or serialising the GPU. Its efficient implementation requires access to the carry flag and 32x32=>64 bit integer multiplication. This raises the question whether there G80 assember manual is PD (I have not noticed it but have not searched) and whether there is an asm() facility in nvcc (can write a ptx file I guess) otherwise the C implementation requires beaking it down into 16bit integers.

For quite a while I have been using the original Mersenne Twist very successfully in my application - it is very fast, efficient and has very good characteristics. My evaluation for the G80 architecture indicates that it is an absolutely ideal candidate (IMHO)… yes there is a large state vector of 2.5Kb but that is just the last 624 32bit results and this may be computed in parallel in 3 steps (first 227 results, next 227 and then the final 170) using fast opcodes on the G80 (no div or mod). No leapfrog required. If your algorithm is using randoms but not as most of the application load then it would be OK to keep this in global memory - if it was very random intense you could probably afford to keep it in shared memory. If no-one else does it first I will post an implementation - it should be an example in the SDK (it could be a while before I get to it - waiting for next release of CUDA).

Cheers, Eric

I’ve got an implementation that generates over 3 billion MT random numbers/second on the G80 (about 2.7 billion/second random floats between 0 and 1)… here is the main bit of code:

#define N (624)

#define MT_THREADS 312

#define M (397)

#define MATRIX_A 0x9908b0dfUL   /* constant vector a */

#define UPPER_MASK 0x80000000UL /* most significant w-r bits */

#define LOWER_MASK 0x7fffffffUL /* least significant r bits */

#ifdef _WIN32

typedef unsigned int uint;

#endif //_WIN32

__shared__ uint g_seeds[624];

__device__ uint mag01(uint fs) {

    return fs * MATRIX_A;

}

__device__ uint2 mt_gen_rand(uint tid) { 

    int kk;

    uint y;

    uint2 ret;

   __syncthreads();

   kk = tid;

    //first 227

    if(kk < 227) {

        y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

        g_seeds[kk] = g_seeds[kk+M] ^ (y >> 1) ^ mag01(y & 0x1UL);

    }

    kk += 227;    

    __syncthreads();

    //next 192

    if(kk >= 227 && kk < 419) {

        y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

        g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1) ^ mag01(y & 0x1UL);

    }

    kk += 192;

    __syncthreads();

    //last 205

    if(kk >= 419 && kk < 623) {

        y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

        g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1) ^ mag01(y & 0x1UL);

    }

    if(kk == 623) {

        y = (g_seeds[N-1]&UPPER_MASK)|(g_seeds[0]&LOWER_MASK);

        g_seeds[N-1] = g_seeds[M-1] ^ (y >> 1) ^ mag01(y & 0x1UL);

    }

    

    __syncthreads();

   ret.x = g_seeds[2*tid];

    

       /* Tempering */

    ret.x ^= (ret.x >> 11);

    ret.x ^= (ret.x << 7) & 0x9d2c5680UL;

    ret.x ^= (ret.x << 15) & 0xefc60000UL;

    ret.x ^= (ret.x >> 18);

    

    ret.y = g_seeds[2*tid+1];

    

       /* Tempering */

    ret.y ^= (ret.y >> 11);

    ret.y ^= (ret.y << 7) & 0x9d2c5680UL;

    ret.y ^= (ret.y << 15) & 0xefc60000UL;

    ret.y ^= (ret.y >> 18);

   return ret;

}

Yep, that’s exactly it! One could do up to 227 at step 2, reducing it to 192 I guess minimises overall block usage. Average thread utilisation > 208 which is good (tempering runs all 312 threads here). Perhaps Nvidia would like to package this up into a sample as there have been a few questions on RNGs.

Thanks bbudge!

and BTW you probably should have included the following since this is a public forum and inserting a few syncthreads() and other minor adjustments would count as just modifications:

:) Good call. I didn’t think of that.

I should note that it may be faster/better to use this in it’s own kernel to generate random numbers, and then run another kernel to use them. Its much easier this way due mostly to the __syncthreads() calls, but also because you can get better performance with a multiple of 32 threads.

Brian

I’ve had a request for seeding this thing. Here’s typically how I use it.

These need to be called before and after you generate more random numbers… essentially saving the current state vector out and reading it in.

__device__ void mt_resume(uint *seeds, uint tid) {

    uint twotid = 2*tid;

    uint totalid = blockIdx.x * N + twotid;

    g_seeds[twotid] = seeds[totalid];

    g_seeds[twotid+1] = seeds[totalid+1];

    __syncthreads();

}

__device__ void mt_stop(uint *seeds, uint tid) {

    __syncthreads();

    uint twotid = 2*tid;

    uint totalid = blockIdx.x * N + twotid;

    seeds[totalid] = g_seeds[twotid];

    seeds[totalid+1] = g_seeds[twotid+1];

}

These can be called at the beginning… for simplicity, and because speed shouldn’t matter much here, I just use thread 0 of each block to calculate this… also, I think it would be quite tricky to parallelize.

__global__ void mt_rand_init(uint *seeds) {

    uint tid = threadIdx.x;

    if(tid == 0) {

        g_seeds[0] = 1 + blockIdx.x;

        for (uint i = 1; i < N; ++i) {

            g_seeds[i] = (1812433253UL * (g_seeds[i-1] ^ (g_seeds[i-1] >> 30)) + i); 

        }

    }

   mt_stop(seeds, tid);

}

__global__ void mt_rand_init_array(uint *seeds, uint *init) {

    uint tid = threadIdx.x;

    if(tid == 0) {

        g_seeds[0] = init[blockIdx.x];

        for (uint i = 1; i < N; ++i) {

            g_seeds[i] = (1812433253UL * (g_seeds[i-1] ^ (g_seeds[i-1] >> 30)) + i); 

        }

    }

   mt_stop(seeds, tid);

}

Brian

Hi Brian, Just as a coda… my expression of the core body looks like:

 kk = tid;

   if(kk < 227) {

       //first 227

       y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

       g_seeds[kk] = g_seeds[kk+M] ^ (y >> 1);

       if (y & 1) {

           g_seeds[kk] ^= MATRIX_A;

       }

       kk += 227;

       __syncthreads();

      //next 227

       y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

       g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1);

       if (y & 1) {

           g_seeds[kk] ^= MATRIX_A;

       }

       kk += 227;

       __syncthreads();

      //last 170

       if(kk < 623) {

           y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

           g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1);

           if (y & 1) {

               g_seeds[kk] ^= MATRIX_A;

           }

       }

       else if(kk == 623) {

           y = (g_seeds[N-1]&UPPER_MASK)|(g_seeds[0]&LOWER_MASK);

           g_seeds[N-1] = g_seeds[M-1] ^ (y >> 1);

           if (y & 1) {

               g_seeds[N-1] ^= MATRIX_A;

           }

       }

   }

   __syncthreads();

Now I don’t have a running instal so can’t test and I have only discovered CUDA a week or so ago, so bear with me - my understanding of the doco indicates that the above code should be congruent. Several issues, presumably the compiler inlined your calls to mag01() but that would have had to use the slow 8 clock integer multiply whereas a compare with a short enough body for branch predication should be faster. This runs 8 warps for the first 2 steps then 7 for the last, whereas yours runs 8,7,8 which should be equivalent in performance. Do you agree?

One final question for the experts: if your application is significantly lumpy (blocks varying in number of running warps) is there any overhead for the warps that just drop from sync to sync (in your example the top warps drop from one to another, above the top ones wait at the last)? This is of interest if using a routine like this within a block with a full house of 16 warps.

Thanks, Eric (this is where you tell me it does not work!)

I haven’t looked in depth at your code, my brain is way too fried at the moment to make sure it works :) You’re totally right about the mag01() function… assuming predication is used, that should be a bit faster. Another approach would be to use constant memory, or a shared memory location (difficult to say which would be faster), and index in to get the value.

Actually, the number of warps I’m using is 8,6,7. I’m not sure how significant this actually is, but if you are running multiple blocks per multiprocessor, I suppose it makes a difference.

I’m not entirely sure about your last paragraph there. Not 100% understanding…

Ooops - and mine runs 8,8,6… On the last question I mean that like this code, any complex app is going to have to grab its maximum number of threads in each block and then if one does as here (always use from 0->number required, which is a neat method) the top threads (warps) will always just keep skipping from sync to sync without doing any computation until they are required. Question is does this cost anything in performance?

ed: OK there must be at least a TID comparison.

Ah, I think I understand your question. You’re wondering about the idle warps… wondering if they have overhead.

My understanding is that they are essentially free (other than the tid comparison), but this would be a question for a G80 engineer for sure.

Hi Brian, I think that there is a requirement that every thread gets to every sync, no exceptions, so my previous post was broken in that regard… try this:

#define B2      224             /* Size of second block */

   kk = tid;

   //first 227

    if(kk < N-M) {

       y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

       g_seeds[kk] = g_seeds[kk+M] ^ (y >> 1);

       if (y & 1) {

           g_seeds[kk] ^= MATRIX_A;

       }

       kk += N-M;

    }

    __syncthreads();

   //next 224

    if(kk < N-M + B2) {

       y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

       g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1);

       if (y & 1) {

           g_seeds[kk] ^= MATRIX_A;

       }

       kk += B2;

    }

    __syncthreads();

   //last 173

    if(kk < N-1) {

        y = (g_seeds[kk]&UPPER_MASK)|(g_seeds[kk+1]&LOWER_MASK);

        g_seeds[kk] = g_seeds[kk+(M-N)] ^ (y >> 1);

        if (y & 1) {

           g_seeds[kk] ^= MATRIX_A;

        }

    }

    else if(kk == N-1) {

        y = (g_seeds[N-1]&UPPER_MASK)|(g_seeds[0]&LOWER_MASK);

        g_seeds[N-1] = g_seeds[M-1] ^ (y >> 1);

        if (y & 1) {

           g_seeds[N-1] ^= MATRIX_A;

        }

    }

    __syncthreads();

So it looks like it is possibly quite expensive to have lots of idle warps above those that you are using.

Thanks, Eric

ed: Nvidia it would be really nice if __syncthreads() would accept an arg - # of threads to sync!