CURAND CURAND low per

Can anyone tell me why generating random numbers with CURAND is much slower (100times) than MATLAB’s randn? Or am I doing something wrong?
Thank you.
My code:
N2=512*512;
float *q;
cudaMalloc( (void **) &q,sizeof(float)*N2);
curandGenerator_t gen;

curandCreateGenerator(&gen,CURAND_RNG_PSEUDO_DEFAULT);

for(int j=0;j<100;j++) {

curandSetPseudoRandomGeneratorSeed(gen, (unsigned long int)j*1000);

curandGenerateNormal(gen, q, N2, 0.0, 1.0);

}

curandDestroyGenerator(gen);

call curandSetPseudoRandomGeneratorSeed() only once, take it out of the loop.

I have a question about the performance of CURAND too.

I compared CURAND random number generation on the host with RANLUX and the second is much much faster. Has anyone know why the random generation on the host is so expensive using CURAND?

Thanks!

Would it be possible to post a self-contained repro case?

Here is part of the code I used to compare both RNG:


//RANLUX:
float ran[2];
clock_t t1 = clock();
rlxs_init(1,12345);

for (int i=0 ; i<1000000 ; ++i)
ranlxs(ran,2);

clock_t t2 = clock();
cout << "Done RANLUX test in " << (t2-t1) << endl;

//CURAND HOST
curandGenerator_t gen;
t1 = clock();

curandCreateGeneratorHost(&gen,CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(gen,12345);

for (int i=0 ; i<1000000 ; ++i)
curandGenerateUniform(gen,ran,2);

curandDestroyGenerator(gen);

t2 = clock();
cout << "Done CURAND test in " << (t2-t1) << endl;


The output was:
Done RANLUX test in 120000
Done CURAND test in 20940000

As you can see, the difference is too large. To compile the code I used: nvcc -O3 -o test test.cpp ranlxs.o -lcurand

Maybe I am using wrong the curand functions.

Thanks!
ydm

The initial release of CURAND in CUDA 3.2 is optimized for throughput of generating lots of random numbers. To get the best performance you generally want to set up the generator once, then generate lots of random numbers in blocks that are as large as possible.

So platinor is right, you probably want to move the call to curandSetPseudoRandomGeneratorSeed() outside the loop. That would make CURAND only do the setup once.

Thanks for posting the code, ydelpi.

for (int i=0 ; i<1000000 ; ++i)

  curandGenerateUniform(gen,ran,2);

The above snippet of code does 1000000 calls that generate two random numbers each time. It will be much more efficient to do fewer calls that generate more random numbers each time. For example, you might try two calls that generate 1000000 random numbers each. In general you will get the best performance from CURAND by generating blocks of random numbers that are as large as possible.

If you get some good performance numbers please post them, I’d love to see them.

Hello,

I am sorry for the very late replay. You are right, generating the numbers in huge blocks is much much faster. I performed some tests and got the following results.

I generated (as before) the random numbers (RNs) in blocks of size N:

  1. N<3000 : CURAND-host is much faster.
  2. N>3000 : CURAND-device is faster.
  3. Generating the RNs in blocks of ~15000 (CURAND-device is ~10 faster than CURAND-host)

I performed the test using an Core-i7 and a GTX280.

Thanks,
ydm

Plot of Park-Miller random numbers see http://forums.nvidia.com/index.php?showtopic=155695

for a variety of numbers of threads and GPU attached.

Top speed more than 25 billion random numbers per second needs at least 32768 threads

but the knee point is about 4095 threads.

Bill
speed.pdf (3.91 KB)