bi-directional cudamemcpy on Fermi

Anyone have speedup numbers on bi-directional cudaMemcpyAsync (i.e. concurrent h->d and d->h) on Fermis?

Bidirectional memcpys are only supported on GF100 Teslas.

Let’s say that we have a Tesla GPU with 2 DMA engines connected over a PCIe 2.0 bus.

What’s approximately the max. memcpy bandwidth that could be achieved per direction?

a) single directional transfers, no overlaps, e.g. only memcpyH2Ds

b) bi-directional and fully overlapping transfers, e.g. concurrent memcpyH2Ds and memcpyD2Hs