I noticed that the real time used by my program for short computations was longer on some new GPUs than the CPU. I isolated the delay to the runtime initialization associated with the first memory allocation. My initial device query is fast, but the initialization, which can be triggered with a cudaFree(0) as others have suggested, is slow. This happens when the process is run in rapid succession and running nvidia-smi makes no difference.
By slow I mean over a second, not the dozens of milliseconds others have reported. Here are some measurements of initialization time under RedHat Enterprise 5:
1.55 sec GTX 460 driver 195.36.31
1.49 sec GT 420 driver 260.19.44
1.43 sec GTX 580 driver 260.19.44
7 older cards (3 GTX 275, GTX 285, FX 3800, Geforce 210 and 250), all running driver 195.36.31, have initialization times ranging from 0.05 to 0.15 sec.
This was all with CUDA 3.0. CUDA 3.1 gives similar results. But it gets worse with CUDA 3.2, the time is 5.5 seconds with the GTX 580.
Also, bandwidthTest takes 4.5 seconds to run on the GTX 580 with CUDA 3.0 and less than a second on the older cards.
All these times seem pathologically long. Are others seeing such times? Could there be something special (security-related?) about our systems that is making this happen?