random numbers inside the Kernel

Lev · August 27, 2010, 11:52am

Just to say I have just recompiled with CUDA 3.1 on 64bit Centos Linux for a GeForce 295 GTX.

The double precision (default) version is still miles faster than the 64 bit integer version.

(Presumably because double precision multiply is still much faster than 64 bit integer reminder % modulus.)

Bill

ps: now at
    Department of Computer Science

    University College London

    Gower Street, London WC1E 6BT, UK

    <a target='_blank' rel='noopener noreferrer' href='"http://www.cs.ucl.ac.uk/staff/W.Langdon/"'>http://www.cs.ucl.ac.uk/staff/W.Langdon/</a>

Can you please tell how do you generate random seeds for each thread? Do you get thread number as rand seed?

MisterAnderson42 · August 27, 2010, 12:36pm

This paper recently posted to arxiv discusses several different RNG strategies:
[url=“http://arxiv4.library.cornell.edu/abs/1003.1123”]http://arxiv4.library.cornell.edu/abs/1003.1123[/url]

MisterAnderson42 · August 27, 2010, 12:36pm

This paper recently posted to arxiv discusses several different RNG strategies:
[url=“http://arxiv4.library.cornell.edu/abs/1003.1123”]http://arxiv4.library.cornell.edu/abs/1003.1123[/url]

jan.heckman · August 27, 2010, 2:43pm

Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:

__global__ void createRand(int *d_out, int max)

{

  int idx = blockDim.x*blockIdx.x + threadIdx.x;

  if(idx < max)

	d_out[idx]=randInt();

}

and this is the main function:

int main(int argc, char *argv[])

{

  const int MAX = 128*1024;

  int *d_a,*h_a;

  size_t sizemem = sizeof(int)*MAX;

  cudaError_t d_error = cudaSuccess;

if((h_a=(int*)malloc(sizemem))==NULL)

  {

	perror("malloc");

	return 1;

  }

if((d_error=cudaMalloc((void**)&d_a,sizemem))!=cudaSuccess)

  {

	fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

d_error = cudaMemset(d_a,0,sizemem);

  if(d_error!=cudaSuccess)

  {

	fprintf(stderr,"cudaMemset: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

dim3 blockSize(512);

  dim3 blockQtd(128);

  createRand<<<blockQtd,blockSize>>>(d_a,MAX);

cudaThreadSynchronize();

  cudaMemcpy(h_a,d_a,sizemem,cudaMemcpyDeviceToHost);

for(int i = 0; i < MAX;i++)

	printf("%d: %d\n",i,h_a[i]);

free(h_a);

  cudaFree(d_a);

return 0;

}

However, what’s strange, is that after (around) the 65529th number, everything is zero :o

Is something wrong with the way I’m calling it?

Thanks.

The number of threads you launch (128512) is half of the # of randoms you want (1281024).

(answer is a bit late…)

jan.heckman · August 27, 2010, 2:43pm

Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:

__global__ void createRand(int *d_out, int max)

{

  int idx = blockDim.x*blockIdx.x + threadIdx.x;

  if(idx < max)

	d_out[idx]=randInt();

}

and this is the main function:

int main(int argc, char *argv[])

{

  const int MAX = 128*1024;

  int *d_a,*h_a;

  size_t sizemem = sizeof(int)*MAX;

  cudaError_t d_error = cudaSuccess;

if((h_a=(int*)malloc(sizemem))==NULL)

  {

	perror("malloc");

	return 1;

  }

if((d_error=cudaMalloc((void**)&d_a,sizemem))!=cudaSuccess)

  {

	fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

d_error = cudaMemset(d_a,0,sizemem);

  if(d_error!=cudaSuccess)

  {

	fprintf(stderr,"cudaMemset: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

dim3 blockSize(512);

  dim3 blockQtd(128);

  createRand<<<blockQtd,blockSize>>>(d_a,MAX);

cudaThreadSynchronize();

  cudaMemcpy(h_a,d_a,sizemem,cudaMemcpyDeviceToHost);

for(int i = 0; i < MAX;i++)

	printf("%d: %d\n",i,h_a[i]);

free(h_a);

  cudaFree(d_a);

return 0;

}

However, what’s strange, is that after (around) the 65529th number, everything is zero :o

Is something wrong with the way I’m calling it?

Thanks.

The number of threads you launch (128512) is half of the # of randoms you want (1281024).

(answer is a bit late…)

jan.heckman · August 27, 2010, 2:45pm

Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:

__global__ void createRand(int *d_out, int max)

{

  int idx = blockDim.x*blockIdx.x + threadIdx.x;

  if(idx < max)

	d_out[idx]=randInt();

}

and this is the main function:

int main(int argc, char *argv[])

{

  const int MAX = 128*1024;

  int *d_a,*h_a;

  size_t sizemem = sizeof(int)*MAX;

  cudaError_t d_error = cudaSuccess;

if((h_a=(int*)malloc(sizemem))==NULL)

  {

	perror("malloc");

	return 1;

  }

if((d_error=cudaMalloc((void**)&d_a,sizemem))!=cudaSuccess)

  {

	fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

d_error = cudaMemset(d_a,0,sizemem);

  if(d_error!=cudaSuccess)

  {

	fprintf(stderr,"cudaMemset: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

dim3 blockSize(512);

  dim3 blockQtd(128);

  createRand<<<blockQtd,blockSize>>>(d_a,MAX);

cudaThreadSynchronize();

  cudaMemcpy(h_a,d_a,sizemem,cudaMemcpyDeviceToHost);

for(int i = 0; i < MAX;i++)

	printf("%d: %d\n",i,h_a[i]);

free(h_a);

  cudaFree(d_a);

return 0;

}

However, what’s strange, is that after (around) the 65529th number, everything is zero :o

Is something wrong with the way I’m calling it?

Thanks.

The number of threads you launch (128512) is half of the # of randoms you want (1281024).

(answer is a bit late…)

jan.heckman · August 27, 2010, 2:45pm

Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:

__global__ void createRand(int *d_out, int max)

{

  int idx = blockDim.x*blockIdx.x + threadIdx.x;

  if(idx < max)

	d_out[idx]=randInt();

}

and this is the main function:

int main(int argc, char *argv[])

{

  const int MAX = 128*1024;

  int *d_a,*h_a;

  size_t sizemem = sizeof(int)*MAX;

  cudaError_t d_error = cudaSuccess;

if((h_a=(int*)malloc(sizemem))==NULL)

  {

	perror("malloc");

	return 1;

  }

if((d_error=cudaMalloc((void**)&d_a,sizemem))!=cudaSuccess)

  {

	fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

d_error = cudaMemset(d_a,0,sizemem);

  if(d_error!=cudaSuccess)

  {

	fprintf(stderr,"cudaMemset: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

dim3 blockSize(512);

  dim3 blockQtd(128);

  createRand<<<blockQtd,blockSize>>>(d_a,MAX);

cudaThreadSynchronize();

  cudaMemcpy(h_a,d_a,sizemem,cudaMemcpyDeviceToHost);

for(int i = 0; i < MAX;i++)

	printf("%d: %d\n",i,h_a[i]);

free(h_a);

  cudaFree(d_a);

return 0;

}

However, what’s strange, is that after (around) the 65529th number, everything is zero :o

Is something wrong with the way I’m calling it?

Thanks.

The number of threads you launch (128512) is half of the # of randoms you want (1281024).

(answer is a bit late…)

Lev · August 27, 2010, 6:00pm

You might be interested in the discussion of random numbers in CUDA at http://forums.nvidia.com/index.php?showtop…t=0#entry956395

My CUDA kernel version of Park-Miller’s minimal standard linear congruential random number generator can be obtained via http://www.cs.ucl.ac.uk/staff/W.Langdon/WB…gdon:2009:CIGPU

Bill

Dr. W. B. Langdon,
    Department of Computer Science,

    King's College London,

    Strand, London, WC2R 2LS, UK

    <a target='_blank' rel='noopener noreferrer' href='"http://www.dcs.kcl.ac.uk/staff/W.Langdon/"'>http://www.dcs.kcl.ac.uk/staff/W.Langdon/</a>
FOGA 2011 http://www.sigevo.org/foga-2011/

CIGPU 2010 http://www.cs.ucl.ac.uk/external/W.Langdon/cigpu

A Field Guide to Genetic Programming
                   <a target='_blank' rel='noopener noreferrer' href='"http://www.gp-field-guide.org.uk/"'>http://www.gp-field-guide.org.uk/</a>
RNAnet http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet

GP EM http://www.springer.com/10710

GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/

I can say using park-miller rng this way is not so good. It may lead to low quality rng. People should avoid such technique in serious simulations. Also linear congruent random is not good too.

Lev · August 27, 2010, 6:00pm

You might be interested in the discussion of random numbers in CUDA at http://forums.nvidia.com/index.php?showtop…t=0#entry956395

My CUDA kernel version of Park-Miller’s minimal standard linear congruential random number generator can be obtained via http://www.cs.ucl.ac.uk/staff/W.Langdon/WB…gdon:2009:CIGPU

Bill

Dr. W. B. Langdon,
    Department of Computer Science,

    King's College London,

    Strand, London, WC2R 2LS, UK

    <a target='_blank' rel='noopener noreferrer' href='"http://www.dcs.kcl.ac.uk/staff/W.Langdon/"'>http://www.dcs.kcl.ac.uk/staff/W.Langdon/</a>
FOGA 2011 http://www.sigevo.org/foga-2011/

CIGPU 2010 http://www.cs.ucl.ac.uk/external/W.Langdon/cigpu

A Field Guide to Genetic Programming
                   <a target='_blank' rel='noopener noreferrer' href='"http://www.gp-field-guide.org.uk/"'>http://www.gp-field-guide.org.uk/</a>
RNAnet http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet

GP EM http://www.springer.com/10710

GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/

I can say using park-miller rng this way is not so good. It may lead to low quality rng. People should avoid such technique in serious simulations. Also linear congruent random is not good too.

Lev · August 27, 2010, 6:02pm

It is a good paper, though their best approach one - rng - for- all- threads is not very fast.

Lev · August 27, 2010, 6:02pm

It is a good paper, though their best approach one - rng - for- all- threads is not very fast.

Mr.What · November 26, 2011, 7:58pm

Hello, I’m starting now with CUDA and maybe this is a dumb question, still, I haven’t found a good answer for it.

I know I can’t call system functions inside the Kernel, but is there a way to make a pseudo-random number inside a kernel, with something like this?
srand(time(NULL));

rand();
thanks a lot.

We should be able to do this. For some reason, it does not seem to be clearly addressed either from CURAND documentation, or from the Mersenne Twister documentation. I haven’t found the answer yet, but one would think it would be possible to set up one random number generator state vector for each of your CUDA cores. Then if “rand()” accesses its local random number generator, we might be able to avoid races, and get sufficiently independent random draws. Let us know if anybody has done this yet.

It is NOT reasonable to generate a state vector for every thread, if you have MILLIONS of threads, which is not uncommon.