CURAND CURAND low per

gchantas · December 5, 2010, 9:32am

Can anyone tell me why generating random numbers with CURAND is much slower (100times) than MATLAB’s randn? Or am I doing something wrong?
Thank you.
My code:
N2=512*512;
float *q;
cudaMalloc( (void **) &q,sizeof(float)*N2);
curandGenerator_t gen;

curandCreateGenerator(&gen,CURAND_RNG_PSEUDO_DEFAULT);

for(int j=0;j<100;j++) {

curandSetPseudoRandomGeneratorSeed(gen, (unsigned long int)j*1000);

curandGenerateNormal(gen, q, N2, 0.0, 1.0);
…
}

curandDestroyGenerator(gen);

platinor · January 4, 2011, 1:55pm

call curandSetPseudoRandomGeneratorSeed() only once, take it out of the loop.

ydelpi · January 4, 2011, 9:38pm

I have a question about the performance of CURAND too.

I compared CURAND random number generation on the host with RANLUX and the second is much much faster. Has anyone know why the random generation on the host is so expensive using CURAND?

Thanks!

njuffa · January 5, 2011, 3:25am

Would it be possible to post a self-contained repro case?

ydelpi · January 5, 2011, 9:33am

Here is part of the code I used to compare both RNG:

…
//RANLUX:
float ran[2];
clock_t t1 = clock();
rlxs_init(1,12345);

for (int i=0 ; i<1000000 ; ++i)
ranlxs(ran,2);

clock_t t2 = clock();
cout << "Done RANLUX test in " << (t2-t1) << endl;

//CURAND HOST
curandGenerator_t gen;
t1 = clock();

curandCreateGeneratorHost(&gen,CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(gen,12345);

for (int i=0 ; i<1000000 ; ++i)
curandGenerateUniform(gen,ran,2);

curandDestroyGenerator(gen);

t2 = clock();
cout << "Done CURAND test in " << (t2-t1) << endl;

…

The output was:
Done RANLUX test in 120000
Done CURAND test in 20940000

As you can see, the difference is too large. To compile the code I used: nvcc -O3 -o test test.cpp ranlxs.o -lcurand

Maybe I am using wrong the curand functions.

Thanks!
ydm

NathanW · January 5, 2011, 10:25pm

The initial release of CURAND in CUDA 3.2 is optimized for throughput of generating lots of random numbers. To get the best performance you generally want to set up the generator once, then generate lots of random numbers in blocks that are as large as possible.

So platinor is right, you probably want to move the call to curandSetPseudoRandomGeneratorSeed() outside the loop. That would make CURAND only do the setup once.

NathanW · January 5, 2011, 10:37pm

Thanks for posting the code, ydelpi.

for (int i=0 ; i<1000000 ; ++i)

  curandGenerateUniform(gen,ran,2);

The above snippet of code does 1000000 calls that generate two random numbers each time. It will be much more efficient to do fewer calls that generate more random numbers each time. For example, you might try two calls that generate 1000000 random numbers each. In general you will get the best performance from CURAND by generating blocks of random numbers that are as large as possible.

If you get some good performance numbers please post them, I’d love to see them.

ydelpi · April 1, 2011, 12:43pm

Hello,

I am sorry for the very late replay. You are right, generating the numbers in huge blocks is much much faster. I performed some tests and got the following results.

I generated (as before) the random numbers (RNs) in blocks of size N:

N<3000 : CURAND-host is much faster.
N>3000 : CURAND-device is faster.
Generating the RNs in blocks of ~15000 (CURAND-device is ~10 faster than CURAND-host)

I performed the test using an Core-i7 and a GTX280.

Thanks,
ydm

wlangdon · April 12, 2011, 10:52am

Plot of Park-Miller random numbers see The Official NVIDIA Forums | NVIDIA

for a variety of numbers of threads and GPU attached.

Top speed more than 25 billion random numbers per second needs at least 32768 threads

but the knee point is about 4095 threads.

Bill
speed.pdf (3.91 KB)

Topic		Replies	Views
CURAND performance CURAND low performance CUDA Programming and Performance	1	976	December 6, 2010
CURAND performace? CUDA Programming and Performance	0	3248	July 14, 2011
curand host performance GPU-Accelerated Libraries	6	1334	December 29, 2016
CURAND acting strangely CUDA Programming and Performance	13	21312	April 26, 2011
Differences between host API and device API for CURAND? CUDA Programming and Performance	4	12014	February 16, 2011
CURAND initialization time CUDA Programming and Performance	8	12042	March 8, 2019
Should a kernel initializing random states with curand_init be so slow? CUDA Programming and Performance	10	2157	July 11, 2018
Curand, my implementation works, but I am not sure it's the right way to do it CUDA Programming and Performance cuda	3	863	April 26, 2021
Help me understand why my first CUDA program is so slow? CUDA Programming and Performance cuda	5	893	April 25, 2020
how to get same output by CURAND in CPU and GPU CUDA Programming and Performance	3	5852	July 19, 2011

CURAND CURAND low per

Related topics