I searched and found out that watchdog timer triggered if execution time exceeded a threshold (e.g., several seconds). In my case, the driver crashed if the first cudaMemcpy takes about 2 seconds. I had to reboot the machine to reset the driver.
I checked cpu and gpu buffer size to make sure gpu buffer has enough size to hold the data. I also inserted cudaDeviceSynchronize() after the cudaMemcpy (even though cudaMemcpy has implicit synchronization).
Normally, this cudaMemcpy finishes in milliseconds. Taking around 2.7 seconds proves it has some issues.
My system configuration:
OS: Ubuntu 14.04.1
GPU: GTX 450
Nvidia Driver Version: 352.63
I hope you can understand the issue from my description. If anything is unclear to you, you can also leave a comment.
Thank you.
In order for others to assist you in your debugging efforts it would be extremely helpful if you could post a minimal, complete, compilable and buildable program that reproduces the issue. Any other approach typically leads to lengthy, and often unproductive, forum threads.
Are you sure it is the cudaMemcpy() that takes more than two seconds? Based on your description it seems possible that a previous kernel launch exceeded the time limit and was killed by the watchdog timer, but that this is not detected until a subsequent, synchronizing, call to cudaMemcpy().
I assume you have already performed standard due diligence on your hardware, such as making sure that the GPU is plugged into the the correct PCIe slot, the card is firmly seated in the slot, all GPU power connectors properly connected, the fan on the GPU is operating properly, and nvidia-smi down not show any exceptional operating conditions.
After a long time debugging, I finally figure out why the first cudaMemcpy takes over 2 seconds. I commented out all the kernels of my application and only leave memory allocation and memory copy. This leads to the following scenarios which looks interesting. My application reads external dataset, it might not be applicable to post reproducible code but the bellowing code snippets should explain what the issue it.
The original code, the file named myapp.cu
int num=2;
for(i=0;i<num;i++){
read_cpu_buffer();
gpuErrchk(cudaMemcpy(gpu_buffer, cpu_buffer, bytes, cudaMemcpyHostToDevice));
cudaDeviceSynchronize();
}
First I comment out all the GPU code, and change myapp.cu to myapp.cpp only dealing with the CPU side memory, and measure how long CPU memory read takes i.e., (it is compiled with g++)
You are copying data between the host and the device, yet I see no call to cudaMalloc() in your code that actually allocates device memory for ‘gpu_buffer’. This can’t work.
I recently found out what happened to this issue, only partially.
It turns out that this issue has nothing to do with cudaMemcpy(). When I changed the CPU memory
mapping to fread(), the program ran without any problem. Still need to dig in to find out why
memory mapping takes that long time.
For those might need help later when using cudaMemcpy(), first make sure the destination memory size
is larger than the transferred memory. The second one is make sure the transfer direction is correct.
Lesson learnt: when debugging, make sure every component is working, (in this case, I assume memory mapping is working, actually it is not working), otherwise it would spend too much time in debugging something that is normal (I spent too much time in debugging cudaMemcpy()).
when you map file to memory, it isn’t read immediately - reading occurs only when you actually access the data. so your cudaMemcpy call essentially perfroming fread+cudaMemcpy, much slower than cudaMemcpy itself. in particular HDDs read data at 100-200 MB/s