Why can't I overlap asynchronous memcpy with kernel execution on fermi on win7 and CUDA 5.0?

I cannot even achieve overlapping memcpy and kernel execution with the simpleStreams example in the CUDA SDK, let alone in my own programs. These threads argue it is a problem with the WDDM driver in windows:

and suggest to:

  • flush the WDDM queue with cudaEventQuery() or cudaEventQuery(). (Does not work).
  • submit streams in breadth first manner. (Does not work).

This thread argues it is a bug in fermi:

While this thread:

proposes a solution to mitigate the problems with WDDM on windows. However, it only works for a Tesla card and it requires an additional video card to steer the display, since the proposed drivers are compute-only drivers.

However, none of these threads provide a real solution. I would appreciate it, if NVIDIA could comment on this problem and come up with a solution, since apparently a lot of people are experiencing this.