Why can't I overlap asynchronous memcpy with kernel execution on fermi on win7 and CUDA 5.0?

lucv · May 29, 2013, 11:32pm

I cannot even achieve overlapping memcpy and kernel execution with the simpleStreams example in the CUDA SDK, let alone in my own programs. These threads argue it is a problem with the WDDM driver in windows:

and suggest to:

flush the WDDM queue with cudaEventQuery() or cudaEventQuery(). (Does not work).
submit streams in breadth first manner. (Does not work).

This thread argues it is a bug in fermi:

http://stackoverflow.com/questions/14456236/how-can-i-overlap-memory-transfers-and-kernel-execution-in-a-cuda-application

While this thread:

http://blog.icare3d.org/2010/04/tesla-compute-drivers.html

proposes a solution to mitigate the problems with WDDM on windows. However, it only works for a Tesla card and it requires an additional video card to steer the display, since the proposed drivers are compute-only drivers.

However, none of these threads provide a real solution. I would appreciate it, if NVIDIA could comment on this problem and come up with a solution, since apparently a lot of people are experiencing this.

Topic		Replies	Views
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3219	July 14, 2010
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	924	July 7, 2017
Bug when overlapping tranfert & data CUDA Programming and Performance	1	595	February 11, 2011
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1803	June 23, 2010
cudaMemcpyAsync Question Overlap HostToDevice and DeviceToHost trasfers CUDA Programming and Performance	2	5667	April 2, 2009
Strange behavior with overlap of transfer and compute CUDA Programming and Performance	4	3979	October 19, 2011
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2313	May 30, 2009
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1455	January 23, 2009
Kernel Queueing CUDA Programming and Performance	8	9727	June 29, 2009
Concurrent Kernel Execution / Memory Transfer We can't get it to work... CUDA Programming and Performance	5	4048	March 21, 2009

Why can't I overlap asynchronous memcpy with kernel execution on fermi on win7 and CUDA 5.0?

Related topics