Full duplex PCI transfers exploitable from CUDA? GT200 supports them?

Hi I was reading “CUDA, Supercomputing for the Masses: Part 12” and is states about zero copy :
“At best, it will only deliver PCIe bandwidth performance, but this can be 2x faster than cudaMemcpy because mapped memory is able exploit the full duplex capability of the PCIe bus by reading and writing at the same time. A call to cudaMemcpy can only move data in one direction at a time (i.e., half duplex).”

To my knowledge this could be also exploited using two streams each one using a cudaMemcpyAsyinc (one D2H and one H2D) and with pinned host memory.

But the question is that I think that current hardware (GT200) doesn’t support simultaneous D2H and H2D transfers, so zero copy would also not be able to exploit the 2X PCI bandwith…

I have also “confirmed” that from simple zero copy sample using a long vector (8.000.000) and measuring the aggregate effective bandwith of (around 5.4GBs) which is like a cudaMemcpy is capable of…

I would like to get this sorted aswell. Is full-duplex PCIe 2.0 transfers at all possible with the GT200 arch and cuda 2.2 ?
(Either through use of streams and async memcpy d2h//h2d or via the new 2.2 mapped primitives…)

Zero-copy can be full duplex, but cudaMemcpys are only half-duplex.