really slow cudaGetDeviceCount() several seconds to complete a cudaGetDeviceCount() call


I am having an issue where my cudaGetDeviceCount is taking several (8) seconds. My hardware platform is a Dell R710 with a tesla c2050 running RHEL5.5 (64-bit) with the 3.2 (260.19.26) driver and the 3.2.16 rhel5.5 cuda toolkit.

Section 3.2 of the CUDA C Programming Guide version 3.2 states that “There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual).” cudaGetDeviceCount is in the Device Managment section of the reference manual. So, I wouldn’t think that this delay would be the runtime initialization this is talking about.
Any ideas about why this is taking so long?

Thanks in advance for any assistance,

My output and cuda code are as follows:

[root@10-0-200-171 ~]# date;./a.out;date
Wed Mar 30 20:12:29 MDT 2011
4 gpus, done
Wed Mar 30 20:12:37 MDT 2011
[root@10-0-200-171 ~]# cat
#include <stdio.h>

int main()
int numgpus=0;
printf("%d gpus, done\n",numgpus);
return 0;

Try running nvidia-smi in loop mode with loop interval of 10 seconds as a background process and see if it improves. The nvidia kernel driver unloads a lot of code and state if there is no client connected to it (normally X11, but a user application or nvidia-smi also act do the same thing). The long time you are seeing is probably the time taken for the driver to reload itself and then initialize the card. By keeping nvidia-smi running, the driver won’t unload between user code runs.

Thats it! I ran nvidia-smi in the background and my test runs in less than a second. Thanks.

Hello Joe,

I just bought the same hardware as you describe: Dell R710 and Tesla C2050. But it’s not at all obvious how to connect the NVidia card to the Server. The standard Dell PCI Express x16 Gen 2 riser card has the wrong orientation (meant for single slot cards). Then there’s the problem of

power supply: the riser card will give a grand total of 25W, while the Tesla requires ~270W!

But presumably you know all this and have solved it if you’re worried about getting your CUDA software layer running.

Any hints/stories would be welcome.