I cannot even achieve overlapping memcpy and kernel execution with the simpleStreams example in the CUDA SDK, let alone in my own programs. These threads argue it is a problem with the WDDM driver in windows:
- http://stackoverflow.com/questions/12397798/why-it-is-not-possible-to-overlap-memhtod-with-gpu-kernel-with-gtx-590
- http://stackoverflow.com/questions/13568805/cuda-kernels-not-launching-before-cudadevicesynchronize/13570086#13570086
and suggest to:
- flush the WDDM queue with
cudaEventQuery()
orcudaEventQuery()
. (Does not work). - submit streams in breadth first manner. (Does not work).
This thread argues it is a bug in fermi:
While this thread:
proposes a solution to mitigate the problems with WDDM on windows. However, it only works for a Tesla card and it requires an additional video card to steer the display, since the proposed drivers are compute-only drivers.
However, none of these threads provide a real solution. I would appreciate it, if NVIDIA could comment on this problem and come up with a solution, since apparently a lot of people are experiencing this.