CUDA initialization very slow on GeForce GTX 465 Initialization takes 1-4 *seconds* on GeForce GTX 4

I have some test code that creates a thread, calls cudaSetDevice(), then does some cudaMalloc()s, timing how long they take.

The first cudaMalloc() in each thread takes a long time, while subsequent calls are fast.
That is understood: it is because the first call will cause a CUDA context to be initialized, and
also some overhead for the first use of cudaMalloc itself (see http://forums.nvidia.com/index.php?showtopic=158779).

On my ‘home’ system with 2 x GeForce GTX 465, the first cudaMalloc() takes 1-4 seconds.

On my ‘work’ system with 4 x Tesla 1060 (running the same code) it takes 70 milliseconds.

Does anyone have any idea why the 465 is so slow? Could it be some other aspect of my system making it slow?
Or is the Tesla unusually fast?

The end product may have to run on a whole range of hardware, from ‘PSCs’ to laptops with a single low-powered CUDA
device, so it may be important to know what I should expect from different devices.

Are you running X11 on the GTX465 system?

Not on the GTX 465s. I also have a GeForce 210 that I run X on.

My first cudamalloc is also very slow~ wondering why

As the original poster mentioned above, the very first call to any CUDA API function triggers the creation of a CUDA context “under the hood”. A fair amount of work goes into context creation, so there will be a delay. A multi-second delay can happen under Linux when the kernel module needs to be loaded as part of the context creation process. To keep it resident, turn on persistence mode with nvidia-smi. Users encountering multi-second context creation delays despite using persistence mode should file a bug with a self-contained repro case, noting the exact platform configuration.

Often, cudaMalloc() is the first CUDA API call in a CUDA application and thus gets affected by the context creation delay. If that is inconvenient for some reason, context creation can be triggered by a call to cudaFree(0) prior to the first cudaMalloc().