That seems odd. I’m not sure what could be going on.
You mention that you are using a 470 gtx and also gpu architecture sm_13. The GTX 470 has compute capability 2.0, i.e. sm_20. You might try changing build options to see if that helps.
My only other guess is that it’s related to the precomputed tables. CURAND uses a bunch of precomputed matrices to speed up random state initialization. They are declared “constant” in curand_precalc.h which is included from curand_kernel.h. Maybe there is some sort of problem there (not sure what it would be).
curand_kernel.h contains a lot of device code. I would guess that what is happening is that if you are compiling for sm_13 and running on sm_20, all of that device code is getting JIT re-compiled by the driver for sm_20 at runtime. Your actual code execution won’t have changed, but the wall clock time from start to end has increased enormously because there is 94 seconds of compilation followed by 6 seconds of run time. Compiling for sm_20 should eliminate the need for JIT compilation and restore the run time back to what you expect.