The Cuda samples work fine on the GTX 960. Its on the first call to cudaMemcpy2D.
The program workflow is more or less like that:
- allocate ‘device’ buffer for image A
- allocate ‘host’ buffer for image A
- copy from ‘host’ buffer for image A to ‘device’ buffer for image A.
Unfortunately, I cannot give a self-consistent code here.
This is, because we have a complex framework which does ‘lazy’ transfers under the hood. So image A actually has two buffers - device and host, and some flags which tells which buffer is the valid one. the cudaMemcpy2D mentioned above is done automatically under the hood when image A is requested for read on the device.
Furtermore, all CUDA API functions (and own-written functions which call a GPU kernel) are not called directly in our framework. Instead, for each GPU one ‘GPUWorker’ object is responsible. This object gets ‘jobs’ to be done (via boost::bind) and executes them in a serialized way. This mechanism is in order to ensure CPU-thread-safety (typically, CUDA kernels are not thread-safe as they usually use constant memory, texture references or other non-thread-safe stuff).
I wrote a self-consistent example where the CUDA API calls are done without a GPUWorker object, but ‘unfortantely’ that example works.
And the most strange thing is - the same test code actually WORKS (does not hang) with Visual Studio 2013, Cuda Toolkit 7.0 !! I also checked the result image (after copying host->device), looks fine. also no error code ris eported from ‘cudaMemcpy2D’ fn.
So as we currently migrating everything to VS 2013 and Cuda Toolkit 7.0, I will not investigate this issue further and simply consider it as a strange ‘artifact’ of the combination of Cuda Toolkit 5.0 ,Maxwell card and the newest drivers. We had such stuff already (strange errors on certain cards with certain drivers).