cudaMemcpy2D hangs on Host -> Device copy (for GPU 0, but not for GPU 1)

I have a strange problem: my ‘cudaMemcpy2D’ functions hangs (never finishes), when doing a copy from host to device. I am quite sure that I got all the parameters for the routine right.

The really strange thing is that the routine works properly (does not hang) on GPU 1 (GTX 770, CC 3.0), whereas on GPU 0 (GTX 960, CC 5.X) it hangs.

Windows 64-bit, Cuda Toolkit 5, newest drivers (march 2015).

Any tips how to ‘debug’ such an issue ?

And GPU 0 is otherwise fully functional? Can you successfully run the CUDA samples programs on it? As for the hanging cudaMemcpy2D() is it the first such copy, the millionth such copy? Does the behavior change if you make the copy very small? Does nvidia-smi show anything unusual about GPU 0? What kind of system is this, could there be a misconfigured BIOS messing with the PCIe setup for the slots the GPU is in? What happens if you physically swap the two GPUs?

I am just brainstorming here. I have never seen a cudaMemcpy() of any kind hang [never return]. My guess would be that if the issue isn’t hardware related, there is some corruption of relevant data structures going on, but whether this happens in your application or in the CUDA software stack is unclear if that is the case. Is there an equivalent to valgrind on Windows? If so, I would suggest running the application with that to see whether there are out-of-bounds accesses or incorrect malloc/free scenarios.

Can you show the relevant part of the code?

How about:

  • cuda-memcheck --report-api-errors all
  • nvprof --print-gpu-trace


Is your host waiting on an event or callback that’s following the memcpy or some other operation either before or after the memcpy?

Are you using stream(s) other than the default?

The error code returned by a cudaStreamSynchronize() before and after the memcpy might reveal something.

The Cuda samples work fine on the GTX 960. Its on the first call to cudaMemcpy2D.

The program workflow is more or less like that:

  • allocate ‘device’ buffer for image A
  • allocate ‘host’ buffer for image A
  • copy from ‘host’ buffer for image A to ‘device’ buffer for image A.

Unfortunately, I cannot give a self-consistent code here.
This is, because we have a complex framework which does ‘lazy’ transfers under the hood. So image A actually has two buffers - device and host, and some flags which tells which buffer is the valid one. the cudaMemcpy2D mentioned above is done automatically under the hood when image A is requested for read on the device.

Furtermore, all CUDA API functions (and own-written functions which call a GPU kernel) are not called directly in our framework. Instead, for each GPU one ‘GPUWorker’ object is responsible. This object gets ‘jobs’ to be done (via boost::bind) and executes them in a serialized way. This mechanism is in order to ensure CPU-thread-safety (typically, CUDA kernels are not thread-safe as they usually use constant memory, texture references or other non-thread-safe stuff).

I wrote a self-consistent example where the CUDA API calls are done without a GPUWorker object, but ‘unfortantely’ that example works.

And the most strange thing is - the same test code actually WORKS (does not hang) with Visual Studio 2013, Cuda Toolkit 7.0 !! I also checked the result image (after copying host->device), looks fine. also no error code ris eported from ‘cudaMemcpy2D’ fn.

So as we currently migrating everything to VS 2013 and Cuda Toolkit 7.0, I will not investigate this issue further and simply consider it as a strange ‘artifact’ of the combination of Cuda Toolkit 5.0 ,Maxwell card and the newest drivers. We had such stuff already (strange errors on certain cards with certain drivers).