First CUDA call takes 13 seconds

I set up CUDA on linux after upgrading to a new graphics card (980 ti) and now the first CUDA call of any program takes around 13 seconds to complete on Ubuntu 14.10 (and I have around 100% CPU usage).

I ran the following code:

for(int i =0; i < 2; i++)
{
   auto start = std::chrono::high_resolution_clock::now();
   cudaDeviceSynchronize();
   auto elapsed = std::chrono::high_resolution_clock::now() - start;
   printf("%ld microseconds\n", std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count());
}

Which produced these outputs for the first and second calls respectively:
(Ubuntu 14.10 with 352.21 driver)

13371595 microseconds
3 microseconds

(Windows 8.1 with 353.30 driver)

287011 microseconds
0 microseconds

Note this problem does not only occur with cudaDeviceSynchronize. I also tried an empty kernel call and a cudaMalloc call to similar results. This delay also occurs when running the nvidia-smi command.

I found some old threads (https://devtalk.nvidia.com/default/topic/480579/slow-cuda-programs-39-startup/ and https://devtalk.nvidia.com/default/topic/696488/first-cuda-function-call-very-slow-more-than-a-minute-on-gtx-680-only/?offset=4) mentioning similar problems but none of the solutions discussed there fixed the problem in my case.

Does anyone know how I might solve this?

If you have a lot of system memory and multiple GPUs, the VM initialization time incurred by the GPU driver as it is starting up the CUDA runtime can be significant.

http://stackoverflow.com/questions/31160795/why-does-my-hello-world-program-take-almost-10s

If you have multiple GPUs, try using CUDA_VISIBLE_DEVICES environment variable to limit your test to a single GPU.

I don’t know if you’re able to set persistence mode on the 980 Ti (I think not.) If you can, it might help a bit.

If you set up an X-server on the 980 Ti (probably not optimal for a number of other reasons) I would expect this delay to mostly go away. On windows, the 980 Ti GPU is in WDDM mode which means it is awake and ready to go all the time - thus no VM startup delays.

(VM = Virtual Memory, as in UVM Unified Virtual Memory)

Thanks for the quick response.

I do have a second GPU in the system (a 660 ti), however the only the 980 ti is detected. I have 16GB of system memory, though I never had any initialization problems on my 660 ti which also supported UVM.

I enabled persistence mode earlier, though sadly it didn’t provide any performance gains.

I do have an X-server running on the 980 Ti.

The 660 Ti is not detected? That is quite odd. Not sure what you mean by that.

With the 352.21 driver, if you run nvidia-smi, the 660 Ti is plugged into that system but nvidia-smi doesn’t list it?

To clarify, the 660 ti is identified by lspci but nvidia-smi does not list it as one of the system’s GPUs.

I would suggest investigating that. In fact, if your display is running on the GTX 980 Ti, just remove the 660 Ti from the system if it is not functional. It may be causing unknown problems. You should also make sure that the nouveau driver has been properly removed from the system. That is covered in the linux getting started guide.

Nouveau was the issue. Disabling it fixed the startup time issue and my 660 ti is now recognized. Thanks for the help.