For different streams, can we overlap DeviceToHost (for stream1) and HostToDevice (for stream2) using ‘cudaMemcpyAsync’ transfers?
I ask because of the following understanding:
“PCIe 1.x is often quoted to support a data rate of 250 MB/s in each direction, per lane… This means a sixteen lane (x16) PCIe card would then be theoretically capable of 250 MB/s * 16 = 4 GB/s in each direction.”
I know that either of the above cudaMemcpyAsync transfers, individually, can be overlapped with the kernel execution on stream3 (say).
I tried modifying the simpleStreams sample code, but it serialized the DeviceToHost (for stream1) and HostToDevice (for stream2) transfers. I could be missing something.
Thank you for any insights.
kpg