cudaMemcpyAsync Question Overlap HostToDevice and DeviceToHost trasfers

For different streams, can we overlap DeviceToHost (for stream1) and HostToDevice (for stream2) using ‘cudaMemcpyAsync’ transfers?

I ask because of the following understanding:
“PCIe 1.x is often quoted to support a data rate of 250 MB/s in each direction, per lane… This means a sixteen lane (x16) PCIe card would then be theoretically capable of 250 MB/s * 16 = 4 GB/s in each direction.”

I know that either of the above cudaMemcpyAsync transfers, individually, can be overlapped with the kernel execution on stream3 (say).
I tried modifying the simpleStreams sample code, but it serialized the DeviceToHost (for stream1) and HostToDevice (for stream2) transfers. I could be missing something.
Thank you for any insights.


As I understand section 3.2.6 of the CUDA Programming Guide you can only overlap kernel execution and memory copies.

You can only overlap one memcpy and one kernel–this is a hardware limitation.