Long initialization time C1060

We just received or new Fedora 10 workstation which contains dual C1060’s which I access through SSH. After running a few tests I noticed that it seems to take ~1.7 seconds to run the first cudaMalloc which seems quite high from what I was able to find elsewhere. Can anyone confirm/deny that it is normal to take ~1.7 seconds to do the initial cudaMalloc on the C1060? I have made a simple sample that demonstrates this, the result is ~1.7 seconds. Thanks

I compile the program as such:

nvcc -O3 MallocTest.cu

//MallocTest.cu

#include <iostream>

#include <cuda.h>

//Timers

#include <sys/time.h>

timeval startTime, stopTime, totalTime;

int main(void)

{

   float *a_d; // pointer to device memory

   int N = 1024;

   size_t size = N*sizeof(float);

//Get Starting Timer

   gettimeofday(&startTime, NULL);

// allocate array on device

  cudaMalloc((void **) &a_d, size);

//Print total time

  gettimeofday(&stopTime, NULL);

  timersub(&stopTime, &startTime, &totalTime);

  std::cout << "Wallclock time  : " << totalTime.tv_sec + totalTime.tv_usec/1000000.0 << " seconds." <<std::endl;

//cleanup

 cudaFree(a_d);

}

Is your driver nvidia loaded ? Try with lsmod | grep nvidia, normaly if you are at run level 3 (no X up) each time

you start that code first of all the driver is loaded and the it takes some time.

Yes my runlevel is 3 and lsmod | grep nvidia returned:

nvidia			   9679432  0 

i2c_core			   29216  2 nvidia,i2c_i801

It appears that the driver is not loaded as you said if I am reading the above correctly? Is this pretty much the nature of using a Tesla series card or is there any way to get around it? Thanks for the help.

I think you can run nvidia-smi in a loop in the background to get around this. Lower the sampling rate to ensure it takes a minimum of CPU usage, though.

(there’s a non-trivial amount of time required for the first device to attach to the driver, and that’s reset when all devices have detached)