Runtime initialization slow (1 sec) on 400-500 series cards, very slow (5 sec) with CUDA 3.2

I noticed that the real time used by my program for short computations was longer on some new GPUs than the CPU. I isolated the delay to the runtime initialization associated with the first memory allocation. My initial device query is fast, but the initialization, which can be triggered with a cudaFree(0) as others have suggested, is slow. This happens when the process is run in rapid succession and running nvidia-smi makes no difference.

By slow I mean over a second, not the dozens of milliseconds others have reported. Here are some measurements of initialization time under RedHat Enterprise 5:
1.55 sec GTX 460 driver 195.36.31
1.49 sec GT 420 driver 260.19.44
1.43 sec GTX 580 driver 260.19.44

7 older cards (3 GTX 275, GTX 285, FX 3800, Geforce 210 and 250), all running driver 195.36.31, have initialization times ranging from 0.05 to 0.15 sec.

This was all with CUDA 3.0. CUDA 3.1 gives similar results. But it gets worse with CUDA 3.2, the time is 5.5 seconds with the GTX 580.

Also, bandwidthTest takes 4.5 seconds to run on the GTX 580 with CUDA 3.0 and less than a second on the older cards.

All these times seem pathologically long. Are others seeing such times? Could there be something special (security-related?) about our systems that is making this happen?

Are you running X11 on these cards? If not, the time you are seeing is driver and card initialisation time. The NVIDIA linux driver unloads itself when there are no client connections to it. If you are not running X11, try running nvidia-smi in a loop with a loop time of 20 seconds in the background. That polling from nvidia-smi should stop the driver unloading.

Of course, if you are running X11 using the NVIDIA driver, then it must be something else.

A number of slow initialization bugs (particularly for 64-bit Linux with Fermi cards, first appearing with 64-bit support in CUDA 3.2) have been fixed in the most recent driver, 270.35. Please try with that driver or any newer one.

Thanks. The highest available beta version on the regular download site is 270.26. I just tried that and the device allocation step jumped up to 4 seconds and the initialization is still a second. I’ll look forward to trying 270.35 or higher when it at least reaches beta status.

270.40 is the RC2 driver:

270.41.06 is now released as a recommended driver and it solves the problems that I described.