Overlapping memcpyasync and kernel execution

I have access to a GTX200 based card.

I am trying to make it use streams in order to benefit from overlapping computation and communication.

Sadly my program does not work.

After checking the simpleStreams example, it does not work here either.

Is there something I need to make sure that overlapping comm/comp actually happens?

The card is properly recognized as 1.1 capable when I use the device property example…

Here is the summary of the output given by simpleStreams:

  • memcpy time: 26.81
  • kernel time: 24.55
  • non-streamed: 51.22 (expected 51.36)
  • 8 streams: 49.42 (expected 27.90 = 24.55 + 26.81 / 8).

Is there a way (something like a hardware counter) to check that overlapping is indeed done?