More on CUDA Random Numbers.

I want to use a random number generator (RNG) without (initially) having to become too expert about it. Mersenne Twisters seem to be well established though there are some subtleties with parallel implementations like CUDA. MT19937 seems to be a well used Mersenne Twister, and Brian Mills has posted a CUDA version to this forum.
See

[url=“http://forums.nvidia.com/index.php?showtopic=31159&st=20&p=217039&#entry217039”]http://forums.nvidia.com/index.php?showtop...mp;#entry217039[/url]

I have not seen any “user 's manual” for Brians code so I have attempted an annotated example below. It is meant to generate independent uniformly-distributed random integers on the interval [0, 2^32 - 1] for each thread in a CUDA-grid. The threads are grouped in CUDA-blocks of course, but each distinct thread in the entire grid hopefully receives independent randoms.

The code is untested, but it compiles and runs on a GTX285. See comments within the example code for more details.

My example is not meant to be fast or used for speed tests, but is meant to be correct. It is based on my reading of Brian’s test code. I have used strange numbers for “numbers of blocks” and “numbers of threads (per block)”. I am mostly interested in whether the code is correct, and will generated good quality random numbers.

Please comment on:

  1. The correctness of my use of Brian’s routines.
  2. The correctness and efficiency of Brian’s implementation within the CUDA environment.
  3. Whether this form of parallelism is adequate or if some form of “spawnTwisters.c”, as used in the CUDA SDK (MersenneTwister) is needed.

Thanks!

-------- CUDA Code follows --------------------
/*
Uniform random unsigned ints on interval [0, 2^32].

This file contains an annotated example of the use
of Eric Mills’ mt19937 Mersenne Twister for CUDA,
which generates uniform random 32-bit unsigned ints
for CUDA-threads executed in CUDA-blocks.

Eric’s implementation can be copied from:

[url=“http://forums.nvidia.com/index.php?showtopic=31159&st=20&p=217039&#entry217039”]http://forums.nvidia.com/index.php?showtop...mp;#entry217039[/url]

Four example kernels are exhibited:
shared_mem_227_kern — uses shared memory and up to 227 threads
shared_mem_unlimited_kern — uses shared memory and unlimited number of thread
global_mem_227_kern — uses global memory and up to 227 threads
global_mem_unlimited_kern — uses global memory and unlimited number of thread

To modify this module for different applications,
choose one of the above kernels, and
search for “your kernel-code here”.

The code works by creating an array of randoms using
a (non-parallelized) Mersenne Twister named genrand_int32().
The array name is d_bseeds and it is created by calling
create_block_seed_array(). Each element of d_bseeds
corresponds to a CUDA-block, and is used when the block
starts running to generate an individual seed for each
CUDA-thread within the block by calling either
mt19937si() or mt19937gi().

This scheme does have the possibilty of collisions
between the numbers generated in distinct CUDA-threads,
but my understanding is that collisions are unlikely
because of the large state space used in
Mersenne Twisters. I believe the Mersenne Twister
example currently (May-2009) in the SKD uses
spawnTwister to generate distinct Mersenne Twisters
with even less chance of collision.
This paragraph is based on my own speculations
and may be incorrect. (If you know better, please
let me know!)

For simplicity I have just included two of Eric’s
files, mt19937_ref.cu and mt19937_gold.cpp
int this code using #include. See below.

Compile using:
/usr/local/cuda/bin/nvcc
-I -I. -I/usr/local/cuda/include
-I/home/kielhorn/NVIDI A_CUDA_SDK/common/inc
-L /lib random1.cu -o random1
*/

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cutil.h>

#include “mt19937_ref.cu” // from Eric Mills (osiris) implementation
/* Defines device functions:
mt19937si() // initializer
mt19937s()
mt19937sl()
mt19937gi() // initializer
mt19937g()
mt19937gl()
*/

#include “mt19937_gold.cpp” // from Eric Mills (osiris) implementation
/* Defines functions:
init_genrand()
genrand_int32()
*/
typedef unsigned int Uint;

// parameters
const Uint N_BLOCKS = 500; // Number of blocks in CUDA grid
const Uint N_THREADS_PER_BLOCK = 50; // Number of threads in each CUDA block
const Uint RAND_SEED = 37; /* 37 is an arbitrary seed. Any number could be used.
You could use time(NULL). */

// computed constants
const Uint BLK_SEED_ARRAY_SZ = N_BLOCKS * sizeof(Uint); // seed array size

Uint*
create_block_seed_array()
{
// Load h_bseeds with random unsigneds, and copy to (device) d_bseeds
// Return a pointer to the device array.

Uint h_bseeds[N_BLOCKS]; // array of block-seeds for host
Uint d_bseeds; // array of block-seeds for device
// initialize (non-parallelized) RNG
init_genrand(RAND_SEED); // init genrand_int32()
// load h_seeds[] with randoms;
for(Uint i = 0; i < N_BLOCKS; i++){
h_bseeds[i] = genrand_int32();
}
// allocate d_bseeds on device
CUDA_SAFE_CALL( cudaMalloc((void
*)&d_bseeds, BLK_SEED_ARRAY_SZ) );

// load d_bseeds with values in h_bseeds
CUDA_SAFE_CALL(cudaMemcpy(d_bseeds, h_bseeds, BLK_SEED_ARRAY_SZ, cudaMemcpyHostToDevice));

// return a pointer to (device) array of randoms of size blockDim
return d_bseeds;
}

/*-----------------------------------------------------------
Kernel based on shared memory and no more than 227 threads. */
global void
shared_mem_227_kern( const Uint block_seeds )
{
// initialize MT state for this CUDA-Block using a seed from block_seeds
mt19937si( block_seeds[blockIdx.x] ); // shared mem initializer

// get  random 32-bit ints
Uint random1 = mt19937s();    // version limited to 227 threads
Uint random2 = mt19937s();    // get another, etc.

/* ... your kernel-code here ... */

}

/*-----------------------------------------------------------
Kernel based on shared memory and unlimited numbers of threads */
global void
shared_mem_unlimited_kern( const Uint block_seeds )
{
// initialize MT state for CUDA-Block
mt19937si( block_seeds[blockIdx.x] ); // shared mem initializer

// get  random 32-bit ints
Uint random1 = mt19937sl();    // version with unlimited threads
Uint random2 = mt19937sl();    // get another, etc.

/* ... your kernel-code here ... */

}

/*-----------------------------------------------------------
Kernel based on global memory and no more than 227 threads. */
global void
global_mem_227_kern( const Uint block_seeds )
{
// initialize MT state for CUDA-Block
mt19937gi( block_seeds[blockIdx.x] ); // global mem initializer

// get a random 32-bit int
Uint random1 = mt19937g();    // version limited to 227 threads
Uint random2 = mt19937g();    // get another, etc.

/* ... your kernel-code here ... */

}

/*-----------------------------------------------------------
Kernel based on global memory and unlimited numbers of threads */
global void
global_mem_unlimited_kern( const Uint block_seeds )
{
// initialize MT state for CUDA-Block
mt19937gi( block_seeds[blockIdx.x] ); // global mem initializer

// get a random 32-bit int
Uint random1 = mt19937gl();    // version with unlimited threads
Uint random2 = mt19937gl();    // get another, etc.

/* ... your kernel-code here ... */

}

int
main()
{
Uint *d_bseeds = create_block_seed_array();

const dim3 n_blocks(N_BLOCKS);
const dim3 n_threads_per_block(N_THREADS_PER_BLOCK);

// Example kernel-calls
shared_mem_unlimited_kern<<< n_blocks, n_threads_per_block >>> (d_bseeds);
shared_mem_227_kern <<< n_blocks, n_threads_per_block >>> (d_bseeds);
global_mem_unlimited_kern<<< n_blocks, n_threads_per_block >>> (d_bseeds);
global_mem_227_kern <<< n_blocks, n_threads_per_block >>> (d_bseeds);

cudaFree(d_bseeds);
}

The version of Mersenne Twister discussed here is based on pseudorandom seeds for each thread, and is simpler to use than that presented in the CUDA-SDK where files must be generated by spawnTwister.c and then read as part of the initialization. What is the down-side of doing this? How does the quality of the pseudorandoms compare with that of the SDK version?

–zaz

good post. However, it should be corrected that a pseudorandom seed is for every block not thread.
I’m surprised there isn’t a clean API in cuda to call to generate random numbers (thinking of rand() and randn() on matlab)