curand_kernel.h curand makes my program toooo slow

Hello!

I want to create a rand matrix in my cuda program. Something like the randn function in Matlab. I ONLY add one line to my code:

#include <curand_kernel.h>

and my program becomes 16x slower! (from 6s to 100s) only including the curand_kernel.h file!

I’m using CUDA VS WIZARD, gpu architecture sm_13, 470 gtx, cuda 3.2

That seems odd. I’m not sure what could be going on.

You mention that you are using a 470 gtx and also gpu architecture sm_13. The GTX 470 has compute capability 2.0, i.e. sm_20. You might try changing build options to see if that helps.

My only other guess is that it’s related to the precomputed tables. CURAND uses a bunch of precomputed matrices to speed up random state initialization. They are declared “constant” in curand_precalc.h which is included from curand_kernel.h. Maybe there is some sort of problem there (not sure what it would be).

I try in a 9800 GT (sm_11) and it works perfect. The problem would be the compute capability, or even visual studio. Thanks anyway!

curand_kernel.h contains a lot of device code. I would guess that what is happening is that if you are compiling for sm_13 and running on sm_20, all of that device code is getting JIT re-compiled by the driver for sm_20 at runtime. Your actual code execution won’t have changed, but the wall clock time from start to end has increased enormously because there is 94 seconds of compilation followed by 6 seconds of run time. Compiling for sm_20 should eliminate the need for JIT compilation and restore the run time back to what you expect.