Strange delay on CUDA initialization

Hi, thank you for reading this topic.
I want to get some help on my current issue about strange delay on CUDA initialization.

I’m using three GTX480 cards with no-SLI mode, and running my CUDA application on Linux platform.
In addition, I have another system consisted of three GTX285 cards with no-SLI mode, and running same platform.

When I running my program on the first platform(3 GTX480) at initialization step,
one card will initialized almost immediately(about 0.2ms), while another two cards will take 3000ms±0.5ms.
This is not appear only the first time to execute after re-booting, but it occurs again from second running.
This is also occurs not only in the initialization step(cudaSetDevice function), but also in the data transfer step(first cudaSend function)

But this phenomenon doesn’t occurs in the second platform(3 GTX285) and it makes me crazy…

Consequently, this strange delay yields 6 secs(sometimes 9 secs) delay to my first system.
I tried nvidiasmi tool to hold up the device initialization in the Linux system as I searched, but it doesn’t work at all.
So I think I have another problem on my system, but I can’t find what it is.

Please help me to fix this!

This is a known problem (Fermi-based devices do some additional work at startup for UVA etc), and will be improved in future driver releases.

Thank you, I got a valuable answer from you.

I hope NVidia’s developers will fix this problem as soon as possible…

Is that problem solved by actual driver? I have this problem too, which is a real problem as this makes 1/5 of my total program running time. Persistent Mode is enabled.


Well I had the same problem as well… It seemed by adding “cudaSetDevice(0);” to the very beginning of my program worked to initialize the GPU. Even though the time is greatly reduced, now it takes about 60ms which is still not fast.

I have a GTX560 TI

Btw, How exactly do you enable Persistent Mode? I’m using VS2010

nvidia-smi -pm 1

cudaSetDevice(0) did not solve my problem, i have a delay of about 11sec. The system has 2x C2050 cards.

You would have to use



to intialize both cards. But yeah, 11sec is crazy long. I hope you figure it out soon.

But for now you can just measure your results by neglecting the 11seconds since its not something that should not be happening.