More on CUDA Random Numbers.

zaz · May 21, 2009, 9:31pm

I want to use a random number generator (RNG) without (initially) having to become too expert about it. Mersenne Twisters seem to be well established though there are some subtleties with parallel implementations like CUDA. MT19937 seems to be a well used Mersenne Twister, and Brian Mills has posted a CUDA version to this forum.
See

[url=“http://forums.nvidia.com/index.php?showtopic=31159&st=20&p=217039&#entry217039”]http://forums.nvidia.com/index.php?showtop...mp;#entry217039[/url]

I have not seen any “user 's manual” for Brians code so I have attempted an annotated example below. It is meant to generate independent uniformly-distributed random integers on the interval [0, 2^32 - 1] for each thread in a CUDA-grid. The threads are grouped in CUDA-blocks of course, but each distinct thread in the entire grid hopefully receives independent randoms.

The code is untested, but it compiles and runs on a GTX285. See comments within the example code for more details.

My example is not meant to be fast or used for speed tests, but is meant to be correct. It is based on my reading of Brian’s test code. I have used strange numbers for “numbers of blocks” and “numbers of threads (per block)”. I am mostly interested in whether the code is correct, and will generated good quality random numbers.

Please comment on:

The correctness of my use of Brian’s routines.
The correctness and efficiency of Brian’s implementation within the CUDA environment.
Whether this form of parallelism is adequate or if some form of “spawnTwisters.c”, as used in the CUDA SDK (MersenneTwister) is needed.

Thanks!

-------- CUDA Code follows --------------------
/*
Uniform random unsigned ints on interval [0, 2^32].

This file contains an annotated example of the use
of Eric Mills’ mt19937 Mersenne Twister for CUDA,
which generates uniform random 32-bit unsigned ints
for CUDA-threads executed in CUDA-blocks.

Eric’s implementation can be copied from:

[url=“http://forums.nvidia.com/index.php?showtopic=31159&st=20&p=217039&#entry217039”]http://forums.nvidia.com/index.php?showtop...mp;#entry217039[/url]

Four example kernels are exhibited:
shared_mem_227_kern — uses shared memory and up to 227 threads
shared_mem_unlimited_kern — uses shared memory and unlimited number of thread
global_mem_227_kern — uses global memory and up to 227 threads
global_mem_unlimited_kern — uses global memory and unlimited number of thread

To modify this module for different applications,
choose one of the above kernels, and
search for “your kernel-code here”.

The code works by creating an array of randoms using
a (non-parallelized) Mersenne Twister named genrand_int32().
The array name is d_bseeds and it is created by calling
create_block_seed_array(). Each element of d_bseeds
corresponds to a CUDA-block, and is used when the block
starts running to generate an individual seed for each
CUDA-thread within the block by calling either
mt19937si() or mt19937gi().

This scheme does have the possibilty of collisions
between the numbers generated in distinct CUDA-threads,
but my understanding is that collisions are unlikely
because of the large state space used in
Mersenne Twisters. I believe the Mersenne Twister
example currently (May-2009) in the SKD uses
spawnTwister to generate distinct Mersenne Twisters
with even less chance of collision.
This paragraph is based on my own speculations
and may be incorrect. (If you know better, please
let me know!)

For simplicity I have just included two of Eric’s
files, mt19937_ref.cu and mt19937_gold.cpp
int this code using #include. See below.

Compile using:
/usr/local/cuda/bin/nvcc
-I -I. -I/usr/local/cuda/include
-I/home/kielhorn/NVIDI A_CUDA_SDK/common/inc
-L /lib random1.cu -o random1
*/

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <cutil.h>

#include “mt19937_ref.cu” // from Eric Mills (osiris) implementation
/* Defines device functions:
mt19937si() // initializer
mt19937s()
mt19937sl()
mt19937gi() // initializer
mt19937g()
mt19937gl()
*/

#include “mt19937_gold.cpp” // from Eric Mills (osiris) implementation
/* Defines functions:
init_genrand()
genrand_int32()
*/
typedef unsigned int Uint;

// parameters
const Uint N_BLOCKS = 500; // Number of blocks in CUDA grid
const Uint N_THREADS_PER_BLOCK = 50; // Number of threads in each CUDA block
const Uint RAND_SEED = 37; /* 37 is an arbitrary seed. Any number could be used.
You could use time(NULL). */

// computed constants
const Uint BLK_SEED_ARRAY_SZ = N_BLOCKS * sizeof(Uint); // seed array size

Uint*
create_block_seed_array()
{
// Load h_bseeds with random unsigneds, and copy to (device) d_bseeds
// Return a pointer to the device array.

Uint h_bseeds[N_BLOCKS]; // array of block-seeds for host
Uint d_bseeds; // array of block-seeds for device
// initialize (non-parallelized) RNG
init_genrand(RAND_SEED); // init genrand_int32()
// load h_seeds[] with randoms;
for(Uint i = 0; i < N_BLOCKS; i++){
h_bseeds[i] = genrand_int32();
}
// allocate d_bseeds on device
CUDA_SAFE_CALL( cudaMalloc((void*)&d_bseeds, BLK_SEED_ARRAY_SZ) );

// load d_bseeds with values in h_bseeds
CUDA_SAFE_CALL(cudaMemcpy(d_bseeds, h_bseeds, BLK_SEED_ARRAY_SZ, cudaMemcpyHostToDevice));

// return a pointer to (device) array of randoms of size blockDim
return d_bseeds;
}

/*-----------------------------------------------------------
Kernel based on shared memory and no more than 227 threads. */
global void
shared_mem_227_kern( const Uint block_seeds )
{
// initialize MT state for this CUDA-Block using a seed from block_seeds
mt19937si( block_seeds[blockIdx.x] ); // shared mem initializer

// get  random 32-bit ints
Uint random1 = mt19937s();    // version limited to 227 threads
Uint random2 = mt19937s();    // get another, etc.

/* ... your kernel-code here ... */

}

/*-----------------------------------------------------------
Kernel based on shared memory and unlimited numbers of threads */
global void
shared_mem_unlimited_kern( const Uint block_seeds )
{
// initialize MT state for CUDA-Block
mt19937si( block_seeds[blockIdx.x] ); // shared mem initializer

// get  random 32-bit ints
Uint random1 = mt19937sl();    // version with unlimited threads
Uint random2 = mt19937sl();    // get another, etc.

/* ... your kernel-code here ... */

}

/*-----------------------------------------------------------
Kernel based on global memory and no more than 227 threads. */
global void
global_mem_227_kern( const Uint block_seeds )
{
// initialize MT state for CUDA-Block
mt19937gi( block_seeds[blockIdx.x] ); // global mem initializer

// get a random 32-bit int
Uint random1 = mt19937g();    // version limited to 227 threads
Uint random2 = mt19937g();    // get another, etc.

/* ... your kernel-code here ... */

}

/*-----------------------------------------------------------
Kernel based on global memory and unlimited numbers of threads */
global void
global_mem_unlimited_kern( const Uint block_seeds )
{
// initialize MT state for CUDA-Block
mt19937gi( block_seeds[blockIdx.x] ); // global mem initializer

// get a random 32-bit int
Uint random1 = mt19937gl();    // version with unlimited threads
Uint random2 = mt19937gl();    // get another, etc.

/* ... your kernel-code here ... */

}

int
main()
{
Uint *d_bseeds = create_block_seed_array();

const dim3 n_blocks(N_BLOCKS);
const dim3 n_threads_per_block(N_THREADS_PER_BLOCK);

// Example kernel-calls
shared_mem_unlimited_kern<<< n_blocks, n_threads_per_block >>> (d_bseeds);
shared_mem_227_kern <<< n_blocks, n_threads_per_block >>> (d_bseeds);
global_mem_unlimited_kern<<< n_blocks, n_threads_per_block >>> (d_bseeds);
global_mem_227_kern <<< n_blocks, n_threads_per_block >>> (d_bseeds);

cudaFree(d_bseeds);
}

zaz · May 28, 2009, 8:04pm

The version of Mersenne Twister discussed here is based on pseudorandom seeds for each thread, and is simpler to use than that presented in the CUDA-SDK where files must be generated by spawnTwister.c and then read as part of the initialization. What is the down-side of doing this? How does the quality of the pseudorandoms compare with that of the SDK version?

–zaz

neocortex · July 2, 2009, 7:27pm

good post. However, it should be corrected that a pseudorandom seed is for every block not thread.
I’m surprised there isn’t a clean API in cuda to call to generate random numbers (thinking of rand() and randn() on matlab)

Topic		Replies	Views
What's a good random number generator? CUDA Programming and Performance	21	13126	May 6, 2009
random numbers inside the Kernel CUDA Programming and Performance	31	60436	November 26, 2011
random number generators CUDA Programming and Performance	12	7154	August 7, 2008
random number generation generating random numbers in CUDA CUDA Programming and Performance	64	109344	January 25, 2011
Mersenne twister CUDA problem Calculation issue? CUDA Programming and Performance	2	3598	August 9, 2011
Kernel runs successfully a few times, then crashes CUDA Programming and Performance	2	1830	August 18, 2009
why i need setup_kernel for curand states? GPU-Accelerated Libraries	19	2502	June 14, 2019
CUDA Mersenne Twister re-written in OpenCL CUDA Programming and Performance	1	3954	February 23, 2012
Mersenne Twister SDK: what's going on? CUDA Programming and Performance	10	16881	January 19, 2011
Cuda and RNG CUDA Programming and Performance	1	3170	October 21, 2009

More on CUDA Random Numbers.

Related topics