We just received or new Fedora 10 workstation which contains dual C1060’s which I access through SSH. After running a few tests I noticed that it seems to take ~1.7 seconds to run the first cudaMalloc which seems quite high from what I was able to find elsewhere. Can anyone confirm/deny that it is normal to take ~1.7 seconds to do the initial cudaMalloc on the C1060? I have made a simple sample that demonstrates this, the result is ~1.7 seconds. Thanks
I compile the program as such:
nvcc -O3 MallocTest.cu
//MallocTest.cu
#include <iostream>
#include <cuda.h>
//Timers
#include <sys/time.h>
timeval startTime, stopTime, totalTime;
int main(void)
{
float *a_d; // pointer to device memory
int N = 1024;
size_t size = N*sizeof(float);
//Get Starting Timer
gettimeofday(&startTime, NULL);
// allocate array on device
cudaMalloc((void **) &a_d, size);
//Print total time
gettimeofday(&stopTime, NULL);
timersub(&stopTime, &startTime, &totalTime);
std::cout << "Wallclock time : " << totalTime.tv_sec + totalTime.tv_usec/1000000.0 << " seconds." <<std::endl;
//cleanup
cudaFree(a_d);
}
Yes my runlevel is 3 and lsmod | grep nvidia returned:
nvidia 9679432 0
i2c_core 29216 2 nvidia,i2c_i801
It appears that the driver is not loaded as you said if I am reading the above correctly? Is this pretty much the nature of using a Tesla series card or is there any way to get around it? Thanks for the help.
I think you can run nvidia-smi in a loop in the background to get around this. Lower the sampling rate to ensure it takes a minimum of CPU usage, though.
(there’s a non-trivial amount of time required for the first device to attach to the driver, and that’s reset when all devices have detached)