PCI-E lanes, being full duplex, allows copying in both directions simultaneously. I am copying some data from host to device using cudaMemcpyAsync on 5 streams and then copying back similar amount of data using cudaMemcpyAsync on the 5 streams as well (no kernel run). I expect that 4 of the H2D and D2H mem copies to overlap, but it takes about the same time as copying synchronously.
Does cudaMemcpyAsync allow copying D2H and H2D simultaneously, and if so why am I getting these results?
Host input memory is pinned and write-combined, output memory pinned.