random numbers inside the Kernel

Hello, I’m starting now with CUDA and maybe this is a dumb question, still, I haven’t found a good answer for it.

I know I can’t call system functions inside the Kernel, but is there a way to make a pseudo-random number inside a kernel, with something like this?



thanks a lot.

There are no random number generator intrinsics or CUDA library functions. If you need a random number generator inside kernels, you are going to have to use kernel code for it. The SDK contains at least one random number generator you might want to look at (Mersenne Twister).

Hum, I was afraid of that.

It’s impossible to run them on a device function as well right?

You should also carefully consider the issues with random numbers in a massively parallel algorithm:

  • If the code you typed in the first post had actually worked, it would be quite likely that large groups of threads would get the same initial seed, and therefore use an identical random number sequence.

  • If you try to fix this problem by seeding each thread differently, you can still run into trouble. Not all random number generators give independent sequences when initialized with different seeds.

But yeah, take a look at the Mersenne Twister example, but be careful if you try to adapt another RNG to run in CUDA.

If your application does not require rigorous random numbers, you can do what I do. I create a device array to hold as many random numbers as I want to be in a table, NUMRANDS. I typically setup 16K or 64K elements. Then during program initialization, I create the random numbers by having the host code call a kernel serially for each element in the array, via a randkernel<<<1,1>>>(rand()/RAND_MAX) for a table of ~random floats. It may look strange, but I’ve gotten fairly good use out of such serial calls during initializations only. You wouldn’t want to do such during actual execution of the program in parallel mode.

So, then in the parallel kernel calls, if I need a random number, I just grab one out of the array of rand() values. The key is how to index into the array in a way that results in at least a pseudo random manner. I’ve tried a variety of ways of creating this index at the kernel level. What I have settled on for a two dimensional kernel is:

int x=blockIdx.x*blockDim.x + threadIdx.x;

int y=blockIdx.y*blockDim.y + threadIdx.y;

int randindex=(y*gridwidth + x + framerand + framenumber)%NUMRANDS;

I do create a framerand each frame or generation via a call:

KernelCudaSetFrameRand<<<1,1>>>((float)rand()/RAND_MAX); // which does a multiplication of the argument with NUMRANDS to scale to table size


These <<<1,1>>> kernel calls could be replaced with cudaMemCpy() calls between host and device memory, but sometimes I have trouble with cudaGetSymbolAddress() on anything other than simple device objects, so I avoid this problem by just creating a kernel function that can set device memory serially.

The whole idea of all of this is that the random number table will be cycled through in a manner that tries to randomize the index once per frame and increments the index once per frame so that there is as much of a non-preferential indexing into the random number array as possible.

Like I said, this technique works for applications where you want to add variability but are not too concerned about the rigor with which the random numbers were generated. If you were that concerned, you would not use rand() to generate random floats, but rather the Mersenne Twister or something more convoluted and float based than rand().

Anyway, this has worked well for me in neural simulations where I want to add psuedo random variablity.

Ken Chaffin

it should be something in this style.
Sorry, but I can’t check it by compilation and execution right now.

device unsigned long int next = 1;

#define step_next (next = next * 1103515245 + 12345+
(unsigned long) sqrt( (threadIdx.x)(threadIdx.x)+(threadIdx.y)(threadIdx.y) ) );

device int k_rand(void)
return ((unsigned int) (next / 0x10000) & 0x7FFF);

device int k_random (int num) // negative arguments treated
// as unsigned

return ((long) (next / 0x10000) * (unsigned)num) >> 16;


device void k_srand(unsigned int seed)
next = seed;

device void k_randomize (void)
next = (unsigned long)
(blockIdx.xblockDim.x + threadIdx.x)(blockIdx.xblockDim.x + threadIdx.x)+
blockDim.y + threadIdx.y)(blockIdx.xblockDim.y + threadIdx.y)+ something_pseudorandom from GPU statistical data regiser // or just remove this line



So, you are proposing a congruential random number generator? That is indeed a way to generate random numbers, but in this case , the k_srand() function is not thread safe. If multiple kernel threads call this function, there will be collisions when they try to all read and write the ‘next’ variable. If not, I’m missing something here. This might work if you update ‘next’ via atomic functions.

Ken Chaffin

Sure, Ken. You are right.

Linear congruential PRG with some addition for generation unique number for each thread.

And of course next variable should be private for thread or use atomic operations.

I think first better for productivity then second.

Probably next can be an array next[max_threads] for preventing treads for collisions.


Yes, in most of my programming over the past many years, I have used linear congruential pseudo-random number generator functions any time I needed something better than rand(). For most casual random number generation, rand() is fine after scaled to a float. A few weeks ago I considered implementing my linear congruential random number generator algorithm for CUDA kernel parallel use, but decided that the random tables were good enough for adding noise to my neuron net simulations. The advantage of the array of pre-generated random numbers is that it is inherently thread safe, and there is no penalty to be paid by having to use atomics. Of course there must be some penalty with having to load the device random array and possible coalesced memory problems, but it seems to work very well for me, particularly since I am using Tesla C1060 boards and have lots of device memory.

It would be interesting to see a full blown linear congruential random number generator that was thread safe and fast as that seems to be a lot better option than doing a Mersenne Twister implementation, which is overkill for a lot of things.



I use a varient of the Combined Tausworthe Generator, where I partially share states across a warp. This allows a fairly high quality RNG (better than LCG anyway) while still taking only 32 words of shared memory per block. If you’re worried about the occasional collision between different warps in the block ending up with the same random numbers, then you’ll need to either use atomics, or else just have 1 word per thread.

Here’s the code:

__constant__ unsigned int shift1[4] = {6, 2, 13, 3};

__constant__ unsigned int shift2[4] = {13, 27, 21, 12};

__constant__ unsigned int shift3[4] = {18, 2, 7, 13};

__constant__ unsigned int offset[4] = {4294967294, 4294967288, 4294967280, 4294967168};

__shared__ unsigned int randStates[32];

__device__ unsigned int TausStep(unsigned int &z, int S1, int S2, int S3, unsigned int M)


	unsigned int b = (((z << S1) ^ z) >> S2);

	return z = (((z &M) << S3) ^ b);


__device__ unsigned int randInt()


	TausStep(randStates[threadIdx.x&31], shift1[threadIdx.x&3], shift2[threadIdx.x&3],shift3[threadIdx.x&3],offset[threadIdx.x&3]);

	return (randStates[(threadIdx.x)&31]^randStates[(threadIdx.x+1)&31]^randStates[(threadIdx.x+2)&31]^randStates[(threadIdx.x+3)&31]);


The biggest problem with this generator is that the compiler wants to assign it a rather large number of registers, likely for all the shift coefficients. Ideally, it would use the constant memory cache for this…

Would it be possible to make the shift table constant volatile?

Yes, indeed linear congruential PRNG good only for small data sets and have only one plus: it works fast.

What speed you are expecting form PRNG ?

And how about others expectation from PRNG for your application ?

A some crazy words: may be memory collisions for next variable will be not so bad for realization PRNG ?

It can made this variable more unpredictabe and can give good distibution of data set.

Offcource it dirty practic for programming and willbe unportabe for others platforms and how it will works can discover only working code.

I think using next as array don’t need a big memory resources at least it can be next[warp_size] (32 for modern CUDA) and as maximum as next[cores_on_GPU]

You might be interested in the discussion of random numbers in CUDA at http://forums.nvidia.com/index.php?showtop…t=0#entry956395

My CUDA kernel version of Park-Miller’s minimal standard linear congruential random number generator can be obtained via http://www.cs.ucl.ac.uk/staff/W.Langdon/WB…gdon:2009:CIGPU


    Dr. W. B. Langdon, 
    Department of Computer Science,
    King's College London,
    Strand, London, WC2R 2LS, UK

FOGA 2011 http://www.sigevo.org/foga-2011/
CIGPU 2010 http://www.cs.ucl.ac.uk/external/W.Langdon/cigpu
A Field Guide to Genetic Programming
RNAnet http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet
GP EM http://www.springer.com/10710
GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/

very impressive, William.

What about data set qualty of Park-Miller PRNG ?

Thanks for all this help guys, I’ll try this code (since it seems simpler to me, and I’m just starting in the CUDA world) and see if it goes OK for what I need, I’ll also take a look at the Mersene example in the CUDA SDK (tough I really think it’s overkill as well).

Once again, thanks ;)

Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:

__global__ void createRand(int *d_out, int max)


  int idx = blockDim.x*blockIdx.x + threadIdx.x;

  if(idx < max)



and this is the main function:

int main(int argc, char *argv[])


  const int MAX = 128*1024;

  int *d_a,*h_a;

  size_t sizemem = sizeof(int)*MAX;

  cudaError_t d_error = cudaSuccess;




	return 1;




	fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

	return 1;


d_error = cudaMemset(d_a,0,sizemem);



	fprintf(stderr,"cudaMemset: %s\n",cudaGetErrorString(d_error));

	return 1;


dim3 blockSize(512);

  dim3 blockQtd(128);




for(int i = 0; i < MAX;i++)

	printf("%d: %d\n",i,h_a[i]);



return 0;


However, what’s strange, is that after (around) the 65529th number, everything is zero :o

Is something wrong with the way I’m calling it?



When I originally mentioned memory collisions, I was talking about multithread writes and reads to the same memory location. Your code is not shown at the kernel level, so I was wrong to say that it is not thread safe. It might or might not be thread safe depending on what you do in the kernel. This is particularly true if one needs to generate a PRNG for each kernel thread. As you and others have alluded to, there are several approaches to avoiding this problem, one of which that everthing in the PRNG call in the thread is local to the thread. The other is if there is a an array as large as the number of threads so that each thread can have its own unique memory locations to write and read PRNG sequences to and from. And there are other techniques also.

Ken Chaffin

Dear Dmitry,
Thank you:-)

I am not sure what you mean by “data set qualty”.
All the techniques we are talking about are pseudo random.
There is a wonderful quote from von Neumann

Anyone who considers arithmetical methods of producing
random digits is, of course, in a state of sin.

Perhaps I should just say Park + Miller looked at all the fast
linear congruent PRNGs and recommended this one,
which now has their name, as being the best.
There experiments can be found via

ps: only a few weeks before the CIGPU deadline

Just to say I have just recompiled with CUDA 3.1 on 64bit Centos Linux for a GeForce 295 GTX.
The double precision (default) version is still miles faster than the 64 bit integer version.
(Presumably because double precision multiply is still much faster than 64 bit integer reminder % modulus.)


ps: now at
Department of Computer Science
University College London
Gower Street, London WC1E 6BT, UK

Just to say I have just recompiled with CUDA 3.1 on 64bit Centos Linux for a GeForce 295 GTX.
The double precision (default) version is still miles faster than the 64 bit integer version.
(Presumably because double precision multiply is still much faster than 64 bit integer reminder % modulus.)


ps: now at
Department of Computer Science
University College London
Gower Street, London WC1E 6BT, UK

Can you please tell how do you generate random seeds for each thread? Do you get thread number as rand seed?