I am having an issue where my cudaGetDeviceCount is taking several (8) seconds. My hardware platform is a Dell R710 with a tesla c2050 running RHEL5.5 (64-bit) with the 3.2 (260.19.26) driver and the 3.2.16 rhel5.5 cuda toolkit.
Section 3.2 of the CUDA C Programming Guide version 3.2 states that “There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual).” cudaGetDeviceCount is in the Device Managment section of the reference manual. So, I wouldn’t think that this delay would be the runtime initialization this is talking about.
Any ideas about why this is taking so long?
Try running nvidia-smi in loop mode with loop interval of 10 seconds as a background process and see if it improves. The nvidia kernel driver unloads a lot of code and state if there is no client connected to it (normally X11, but a user application or nvidia-smi also act do the same thing). The long time you are seeing is probably the time taken for the driver to reload itself and then initialize the card. By keeping nvidia-smi running, the driver won’t unload between user code runs.
I just bought the same hardware as you describe: Dell R710 and Tesla C2050. But it’s not at all obvious how to connect the NVidia card to the Server. The standard Dell PCI Express x16 Gen 2 riser card has the wrong orientation (meant for single slot cards). Then there’s the problem of
power supply: the riser card will give a grand total of 25W, while the Tesla requires ~270W!
But presumably you know all this and have solved it if you’re worried about getting your CUDA software layer running.