There are no random number generator intrinsics or CUDA library functions. If you need a random number generator inside kernels, you are going to have to use kernel code for it. The SDK contains at least one random number generator you might want to look at (Mersenne Twister).
You should also carefully consider the issues with random numbers in a massively parallel algorithm:
If the code you typed in the first post had actually worked, it would be quite likely that large groups of threads would get the same initial seed, and therefore use an identical random number sequence.
If you try to fix this problem by seeding each thread differently, you can still run into trouble. Not all random number generators give independent sequences when initialized with different seeds.
But yeah, take a look at the Mersenne Twister example, but be careful if you try to adapt another RNG to run in CUDA.
If your application does not require rigorous random numbers, you can do what I do. I create a device array to hold as many random numbers as I want to be in a table, NUMRANDS. I typically setup 16K or 64K elements. Then during program initialization, I create the random numbers by having the host code call a kernel serially for each element in the array, via a randkernel<<<1,1>>>(rand()/RAND_MAX) for a table of ~random floats. It may look strange, but I’ve gotten fairly good use out of such serial calls during initializations only. You wouldn’t want to do such during actual execution of the program in parallel mode.
So, then in the parallel kernel calls, if I need a random number, I just grab one out of the array of rand() values. The key is how to index into the array in a way that results in at least a pseudo random manner. I’ve tried a variety of ways of creating this index at the kernel level. What I have settled on for a two dimensional kernel is:
int x=blockIdx.x*blockDim.x + threadIdx.x;
int y=blockIdx.y*blockDim.y + threadIdx.y;
int randindex=(y*gridwidth + x + framerand + framenumber)%NUMRANDS;
I do create a framerand each frame or generation via a call:
KernelCudaSetFrameRand<<<1,1>>>((float)rand()/RAND_MAX); // which does a multiplication of the argument with NUMRANDS to scale to table size
These <<<1,1>>> kernel calls could be replaced with cudaMemCpy() calls between host and device memory, but sometimes I have trouble with cudaGetSymbolAddress() on anything other than simple device objects, so I avoid this problem by just creating a kernel function that can set device memory serially.
The whole idea of all of this is that the random number table will be cycled through in a manner that tries to randomize the index once per frame and increments the index once per frame so that there is as much of a non-preferential indexing into the random number array as possible.
Like I said, this technique works for applications where you want to add variability but are not too concerned about the rigor with which the random numbers were generated. If you were that concerned, you would not use rand() to generate random floats, but rather the Mersenne Twister or something more convoluted and float based than rand().
Anyway, this has worked well for me in neural simulations where I want to add psuedo random variablity.
device void k_srand(unsigned int seed)
{
next = seed;
}
device void k_randomize (void)
{
next = (unsigned long)
sqrt(
(blockIdx.xblockDim.x + threadIdx.x)(blockIdx.xblockDim.x + threadIdx.x)+
(blockIdx.yblockDim.y + threadIdx.y)(blockIdx.xblockDim.y + threadIdx.y)+ something_pseudorandom from GPU statistical data regiser // or just remove this line
);
So, you are proposing a congruential random number generator? That is indeed a way to generate random numbers, but in this case , the k_srand() function is not thread safe. If multiple kernel threads call this function, there will be collisions when they try to all read and write the ‘next’ variable. If not, I’m missing something here. This might work if you update ‘next’ via atomic functions.
Yes, in most of my programming over the past many years, I have used linear congruential pseudo-random number generator functions any time I needed something better than rand(). For most casual random number generation, rand() is fine after scaled to a float. A few weeks ago I considered implementing my linear congruential random number generator algorithm for CUDA kernel parallel use, but decided that the random tables were good enough for adding noise to my neuron net simulations. The advantage of the array of pre-generated random numbers is that it is inherently thread safe, and there is no penalty to be paid by having to use atomics. Of course there must be some penalty with having to load the device random array and possible coalesced memory problems, but it seems to work very well for me, particularly since I am using Tesla C1060 boards and have lots of device memory.
It would be interesting to see a full blown linear congruential random number generator that was thread safe and fast as that seems to be a lot better option than doing a Mersenne Twister implementation, which is overkill for a lot of things.
I use a varient of the Combined Tausworthe Generator, where I partially share states across a warp. This allows a fairly high quality RNG (better than LCG anyway) while still taking only 32 words of shared memory per block. If you’re worried about the occasional collision between different warps in the block ending up with the same random numbers, then you’ll need to either use atomics, or else just have 1 word per thread.
Here’s the code:
__constant__ unsigned int shift1[4] = {6, 2, 13, 3};
__constant__ unsigned int shift2[4] = {13, 27, 21, 12};
__constant__ unsigned int shift3[4] = {18, 2, 7, 13};
__constant__ unsigned int offset[4] = {4294967294, 4294967288, 4294967280, 4294967168};
__shared__ unsigned int randStates[32];
__device__ unsigned int TausStep(unsigned int &z, int S1, int S2, int S3, unsigned int M)
{
unsigned int b = (((z << S1) ^ z) >> S2);
return z = (((z &M) << S3) ^ b);
}
__device__ unsigned int randInt()
{
TausStep(randStates[threadIdx.x&31], shift1[threadIdx.x&3], shift2[threadIdx.x&3],shift3[threadIdx.x&3],offset[threadIdx.x&3]);
return (randStates[(threadIdx.x)&31]^randStates[(threadIdx.x+1)&31]^randStates[(threadIdx.x+2)&31]^randStates[(threadIdx.x+3)&31]);
}
The biggest problem with this generator is that the compiler wants to assign it a rather large number of registers, likely for all the shift coefficients. Ideally, it would use the constant memory cache for this…
Would it be possible to make the shift table constant volatile?
Yes, indeed linear congruential PRNG good only for small data sets and have only one plus: it works fast.
What speed you are expecting form PRNG ?
And how about others expectation from PRNG for your application ?
A some crazy words: may be memory collisions for next variable will be not so bad for realization PRNG ?
It can made this variable more unpredictabe and can give good distibution of data set.
Offcource it dirty practic for programming and willbe unportabe for others platforms and how it will works can discover only working code.
I think using next as array don’t need a big memory resources at least it can be next[warp_size] (32 for modern CUDA) and as maximum as next[cores_on_GPU]
Bill
Dr. W. B. Langdon,
Department of Computer Science,
King's College London,
Strand, London, WC2R 2LS, UK
[url="http://www.dcs.kcl.ac.uk/staff/W.Langdon/"]http://www.dcs.kcl.ac.uk/staff/W.Langdon/[/url]
Thanks for all this help guys, I’ll try this code (since it seems simpler to me, and I’m just starting in the CUDA world) and see if it goes OK for what I need, I’ll also take a look at the Mersene example in the CUDA SDK (tough I really think it’s overkill as well).
Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:
__global__ void createRand(int *d_out, int max)
{
int idx = blockDim.x*blockIdx.x + threadIdx.x;
if(idx < max)
d_out[idx]=randInt();
}
When I originally mentioned memory collisions, I was talking about multithread writes and reads to the same memory location. Your code is not shown at the kernel level, so I was wrong to say that it is not thread safe. It might or might not be thread safe depending on what you do in the kernel. This is particularly true if one needs to generate a PRNG for each kernel thread. As you and others have alluded to, there are several approaches to avoiding this problem, one of which that everthing in the PRNG call in the thread is local to the thread. The other is if there is a an array as large as the number of threads so that each thread can have its own unique memory locations to write and read PRNG sequences to and from. And there are other techniques also.
I am not sure what you mean by “data set qualty”.
All the techniques we are talking about are pseudo random.
There is a wonderful quote from von Neumann
Anyone who considers arithmetical methods of producing
random digits is, of course, in a state of sin.
Just to say I have just recompiled with CUDA 3.1 on 64bit Centos Linux for a GeForce 295 GTX.
The double precision (default) version is still miles faster than the 64 bit integer version.
(Presumably because double precision multiply is still much faster than 64 bit integer reminder % modulus.)
Just to say I have just recompiled with CUDA 3.1 on 64bit Centos Linux for a GeForce 295 GTX.
The double precision (default) version is still miles faster than the 64 bit integer version.
(Presumably because double precision multiply is still much faster than 64 bit integer reminder % modulus.)