Lev
August 27, 2010, 11:52am
21
Just to say I have just recompiled with CUDA 3.1 on 64bit Centos Linux for a GeForce 295 GTX.
The double precision (default) version is still miles faster than the 64 bit integer version.
(Presumably because double precision multiply is still much faster than 64 bit integer reminder % modulus.)
Bill
ps: now at
Department of Computer Science
University College London
Gower Street, London WC1E 6BT, UK
<a target='_blank' rel='noopener noreferrer' href='"http://www.cs.ucl.ac.uk/staff/W.Langdon/"'>http://www.cs.ucl.ac.uk/staff/W.Langdon/</a>
Can you please tell how do you generate random seeds for each thread? Do you get thread number as rand seed?
Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:
__global__ void createRand(int *d_out, int max)
{
int idx = blockDim.x*blockIdx.x + threadIdx.x;
if(idx < max)
d_out[idx]=randInt();
}
and this is the main function:
int main(int argc, char *argv[])
{
const int MAX = 128*1024;
int *d_a,*h_a;
size_t sizemem = sizeof(int)*MAX;
cudaError_t d_error = cudaSuccess;
if((h_a=(int*)malloc(sizemem))==NULL)
{
perror("malloc");
return 1;
}
if((d_error=cudaMalloc((void**)&d_a,sizemem))!=cudaSuccess)
{
fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));
return 1;
}
d_error = cudaMemset(d_a,0,sizemem);
if(d_error!=cudaSuccess)
{
fprintf(stderr,"cudaMemset: %s\n",cudaGetErrorString(d_error));
return 1;
}
dim3 blockSize(512);
dim3 blockQtd(128);
createRand<<<blockQtd,blockSize>>>(d_a,MAX);
cudaThreadSynchronize();
cudaMemcpy(h_a,d_a,sizemem,cudaMemcpyDeviceToHost);
for(int i = 0; i < MAX;i++)
printf("%d: %d\n",i,h_a[i]);
free(h_a);
cudaFree(d_a);
return 0;
}
However, what’s strange, is that after (around) the 65529th number, everything is zero :o
Is something wrong with the way I’m calling it?
Thanks.
The number of threads you launch (128512) is half of the # of randoms you want (128 1024).
(answer is a bit late…)
Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:
__global__ void createRand(int *d_out, int max)
{
int idx = blockDim.x*blockIdx.x + threadIdx.x;
if(idx < max)
d_out[idx]=randInt();
}
and this is the main function:
int main(int argc, char *argv[])
{
const int MAX = 128*1024;
int *d_a,*h_a;
size_t sizemem = sizeof(int)*MAX;
cudaError_t d_error = cudaSuccess;
if((h_a=(int*)malloc(sizemem))==NULL)
{
perror("malloc");
return 1;
}
if((d_error=cudaMalloc((void**)&d_a,sizemem))!=cudaSuccess)
{
fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));
return 1;
}
d_error = cudaMemset(d_a,0,sizemem);
if(d_error!=cudaSuccess)
{
fprintf(stderr,"cudaMemset: %s\n",cudaGetErrorString(d_error));
return 1;
}
dim3 blockSize(512);
dim3 blockQtd(128);
createRand<<<blockQtd,blockSize>>>(d_a,MAX);
cudaThreadSynchronize();
cudaMemcpy(h_a,d_a,sizemem,cudaMemcpyDeviceToHost);
for(int i = 0; i < MAX;i++)
printf("%d: %d\n",i,h_a[i]);
free(h_a);
cudaFree(d_a);
return 0;
}
However, what’s strange, is that after (around) the 65529th number, everything is zero :o
Is something wrong with the way I’m calling it?
Thanks.
The number of threads you launch (128512) is half of the # of randoms you want (128 1024).
(answer is a bit late…)
Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:
__global__ void createRand(int *d_out, int max)
{
int idx = blockDim.x*blockIdx.x + threadIdx.x;
if(idx < max)
d_out[idx]=randInt();
}
and this is the main function:
int main(int argc, char *argv[])
{
const int MAX = 128*1024;
int *d_a,*h_a;
size_t sizemem = sizeof(int)*MAX;
cudaError_t d_error = cudaSuccess;
if((h_a=(int*)malloc(sizemem))==NULL)
{
perror("malloc");
return 1;
}
if((d_error=cudaMalloc((void**)&d_a,sizemem))!=cudaSuccess)
{
fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));
return 1;
}
d_error = cudaMemset(d_a,0,sizemem);
if(d_error!=cudaSuccess)
{
fprintf(stderr,"cudaMemset: %s\n",cudaGetErrorString(d_error));
return 1;
}
dim3 blockSize(512);
dim3 blockQtd(128);
createRand<<<blockQtd,blockSize>>>(d_a,MAX);
cudaThreadSynchronize();
cudaMemcpy(h_a,d_a,sizemem,cudaMemcpyDeviceToHost);
for(int i = 0; i < MAX;i++)
printf("%d: %d\n",i,h_a[i]);
free(h_a);
cudaFree(d_a);
return 0;
}
However, what’s strange, is that after (around) the 65529th number, everything is zero :o
Is something wrong with the way I’m calling it?
Thanks.
The number of threads you launch (128512) is half of the # of randoms you want (128 1024).
(answer is a bit late…)
Hello, well, I’ve tried the Combined Tausworthe Generator above, and it works fairly well (it gives me sometimes the same numbers, but it’s OK for what I need), I’ve made this kernel:
__global__ void createRand(int *d_out, int max)
{
int idx = blockDim.x*blockIdx.x + threadIdx.x;
if(idx < max)
d_out[idx]=randInt();
}
and this is the main function:
int main(int argc, char *argv[])
{
const int MAX = 128*1024;
int *d_a,*h_a;
size_t sizemem = sizeof(int)*MAX;
cudaError_t d_error = cudaSuccess;
if((h_a=(int*)malloc(sizemem))==NULL)
{
perror("malloc");
return 1;
}
if((d_error=cudaMalloc((void**)&d_a,sizemem))!=cudaSuccess)
{
fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));
return 1;
}
d_error = cudaMemset(d_a,0,sizemem);
if(d_error!=cudaSuccess)
{
fprintf(stderr,"cudaMemset: %s\n",cudaGetErrorString(d_error));
return 1;
}
dim3 blockSize(512);
dim3 blockQtd(128);
createRand<<<blockQtd,blockSize>>>(d_a,MAX);
cudaThreadSynchronize();
cudaMemcpy(h_a,d_a,sizemem,cudaMemcpyDeviceToHost);
for(int i = 0; i < MAX;i++)
printf("%d: %d\n",i,h_a[i]);
free(h_a);
cudaFree(d_a);
return 0;
}
However, what’s strange, is that after (around) the 65529th number, everything is zero :o
Is something wrong with the way I’m calling it?
Thanks.
The number of threads you launch (128512) is half of the # of randoms you want (128 1024).
(answer is a bit late…)
Lev
August 27, 2010, 6:00pm
28
I can say using park-miller rng this way is not so good. It may lead to low quality rng. People should avoid such technique in serious simulations. Also linear congruent random is not good too.
Lev
August 27, 2010, 6:00pm
29
I can say using park-miller rng this way is not so good. It may lead to low quality rng. People should avoid such technique in serious simulations. Also linear congruent random is not good too.
Lev
August 27, 2010, 6:02pm
30
It is a good paper, though their best approach one - rng - for- all- threads is not very fast.
Lev
August 27, 2010, 6:02pm
31
It is a good paper, though their best approach one - rng - for- all- threads is not very fast.
Mr.What
November 26, 2011, 7:58pm
32
Hello, I’m starting now with CUDA and maybe this is a dumb question, still, I haven’t found a good answer for it.
I know I can’t call system functions inside the Kernel, but is there a way to make a pseudo-random number inside a kernel, with something like this?
srand(time(NULL));
rand();
thanks a lot.
We should be able to do this. For some reason, it does not seem to be clearly addressed either from CURAND documentation, or from the Mersenne Twister documentation. I haven’t found the answer yet, but one would think it would be possible to set up one random number generator state vector for each of your CUDA cores. Then if “rand()” accesses its local random number generator, we might be able to avoid races, and get sufficiently independent random draws. Let us know if anybody has done this yet.
It is NOT reasonable to generate a state vector for every thread, if you have MILLIONS of threads, which is not uncommon.