two (newbie?) questions asynchroneous host->device memcpy+events

According to my measurement, it indeed does’t support overlapping, even they allege it can parallel.

I declare 2 streams, one execute kernel, the other do memcopy. the results indicate that it does’t support parallel, it is also the serialized result.

but the same code run on GTX280 (with capability 1.3), very clearly overlapping, the time depends on the longest used time of kernel and memcopy.

Then what am I doing wrong? The exe in .\SDK\bin\win64\ didn’t work, and recompiling with -arch=sm_13 didn’t help either. I get:

running on: GeForce GTX 260

memcopy:		11.70

kernel:		 18.07

non-streamed:   30.63 (29.77 expected)

4 streams:	  30.48 (21.00 expected with compute capability 1.1 or later)

-------------------------------

Test PASSED

Press ENTER to exit...

Those of you who are getting results that seem to indicate no support for overlapping: what platform are you on?

On vista, overlapping is not possible. It is documented in the known issues section of the release notes (which nobody ever reads, apparently)