Anyone have speedup numbers on bi-directional cudaMemcpyAsync (i.e. concurrent h->d and d->h) on Fermis?
Bidirectional memcpys are only supported on GF100 Teslas.
Let’s say that we have a Tesla GPU with 2 DMA engines connected over a PCIe 2.0 bus.
What’s approximately the max. memcpy bandwidth that could be achieved per direction?
a) single directional transfers, no overlaps, e.g. only memcpyH2Ds
b) bi-directional and fully overlapping transfers, e.g. concurrent memcpyH2Ds and memcpyD2Hs