cudaMemcpy takes more than 2 seconds, then driver crashed.

Hello community,

I got struck with this problem for days, and tried different approaches to solve it. If anyone had similar experience before, please leave a comment.

My application has a cudaMemcpy in a loop with size of 2. Below is the code snippet to illustrate the issue.

int num=2;
for(i=0;i<num;i++){
gpuErrchk(cudaMemcpy(gpu_buffer, cpu_buffer, bytes, cudaMemcpyHostToDevice));
cudaDeviceSynchronize();
}

gpuErrchk is the error checking function. The gpu_buffer has larger size than cpu_buffer meaning that
gpu_buffer has enough size to hold cpu_buffer.

Approaches I tried:

  1. I used nvprof --print-gpu-trace --print-api-trace ./my_app to find out the first cudaMemcpy takes around 2.7 second.
    http://stackoverflow.com/questions/29548456/my-cuda-nvprof-api-trace-and-gpu-trace-are-not-synchronized-what-to-do

  2. I searched and found out that watchdog timer triggered if execution time exceeded a threshold (e.g., several seconds). In my case, the driver crashed if the first cudaMemcpy takes about 2 seconds. I had to reboot the machine to reset the driver.

  3. I checked cpu and gpu buffer size to make sure gpu buffer has enough size to hold the data. I also inserted cudaDeviceSynchronize() after the cudaMemcpy (even though cudaMemcpy has implicit synchronization).

  4. Normally, this cudaMemcpy finishes in milliseconds. Taking around 2.7 seconds proves it has some issues.

My system configuration:
OS: Ubuntu 14.04.1
GPU: GTX 450
Nvidia Driver Version: 352.63

I hope you can understand the issue from my description. If anything is unclear to you, you can also leave a comment.
Thank you.

In order for others to assist you in your debugging efforts it would be extremely helpful if you could post a minimal, complete, compilable and buildable program that reproduces the issue. Any other approach typically leads to lengthy, and often unproductive, forum threads.

Are you sure it is the cudaMemcpy() that takes more than two seconds? Based on your description it seems possible that a previous kernel launch exceeded the time limit and was killed by the watchdog timer, but that this is not detected until a subsequent, synchronizing, call to cudaMemcpy().

I assume you have already performed standard due diligence on your hardware, such as making sure that the GPU is plugged into the the correct PCIe slot, the card is firmly seated in the slot, all GPU power connectors properly connected, the fan on the GPU is operating properly, and nvidia-smi down not show any exceptional operating conditions.

After a long time debugging, I finally figure out why the first cudaMemcpy takes over 2 seconds. I commented out all the kernels of my application and only leave memory allocation and memory copy. This leads to the following scenarios which looks interesting. My application reads external dataset, it might not be applicable to post reproducible code but the bellowing code snippets should explain what the issue it.

The original code, the file named myapp.cu

int num=2;
for(i=0;i<num;i++){
read_cpu_buffer();
gpuErrchk(cudaMemcpy(gpu_buffer, cpu_buffer, bytes, cudaMemcpyHostToDevice));
cudaDeviceSynchronize();
}
  1. First I comment out all the GPU code, and change myapp.cu to myapp.cpp only dealing with the CPU side memory, and measure how long CPU memory read takes i.e., (it is compiled with g++)
int num=2;
for(i=0;i<num;i++){
timer->start();
read_cpu_buffer();
timer->stop();
//gpuErrchk(cudaMemcpy(gpu_buffer, cpu_buffer, bytes, cudaMemcpyHostToDevice));
//cudaDeviceSynchronize();
}

It takes 0.015ms and 1.491ms for two iterations.

  1. When I uncomment the cudaMemcy, and change myapp.cpp back to myapp.cu, i.e., (it is compiled with NVCC)
int num=2;
for(i=0;i<num;i++){
timer->start();
read_cpu_buffer();
timer->stop();
gpuErrchk(cudaMemcpy(gpu_buffer, cpu_buffer, bytes, cudaMemcpyHostToDevice));
cudaDeviceSynchronize();
}

The output is 2743ms and 1.678ms. This is why the first cudaMemcpy takes more than 2 seconds.
I use gettimeofday() to implement the timer.

What is the reason for the two different scenarios above, and why the second one with cudaMemcpy takes 2743ms whereas the first takes only 0.015ms.

Hope my explanation of this issue is clear.

Thank you.

You are copying data between the host and the device, yet I see no call to cudaMalloc() in your code that actually allocates device memory for ‘gpu_buffer’. This can’t work.

The cudaMalloc() was defined but did not include in the snippet, Otherwise, the code can’t be compiled.

I recently found out what happened to this issue, only partially.

It turns out that this issue has nothing to do with cudaMemcpy(). When I changed the CPU memory
mapping to fread(), the program ran without any problem. Still need to dig in to find out why
memory mapping takes that long time.

For those might need help later when using cudaMemcpy(), first make sure the destination memory size
is larger than the transferred memory. The second one is make sure the transfer direction is correct.

Lesson learnt: when debugging, make sure every component is working, (in this case, I assume memory mapping is working, actually it is not working), otherwise it would spend too much time in debugging something that is normal (I spent too much time in debugging cudaMemcpy()).

when you map file to memory, it isn’t read immediately - reading occurs only when you actually access the data. so your cudaMemcpy call essentially perfroming fread+cudaMemcpy, much slower than cudaMemcpy itself. in particular HDDs read data at 100-200 MB/s