concurrency among copies: is it possible?

hi everyone,
I wanna ask something about concurrency. normally in CUDA’s example or tutorial, the concurrency is always done by splitting the process of [H2Dcopy - kernel run - copy H2D] to be several processes by splitting the arrays (H stands for host and D stands for device). but here i wanna ask question. is it possible to have concurrency between copy itself? for example H2D_a is through stream_a, H2D_b is through stream_b, and so on. (H2D_a & H2D_b are different arrays, not split-up arrays)

thank you

If they are on different streams, then theoretically yes.
Just remember that you will always be limitted by bandwith and warps (and some other stuff).
Would love to give you a better answer, but I’m new into to CUDA myself…

Since PCIe is full duplex, simultaneous transfers from and to the device are possible, i.e. two concurrent transfers in opposite directions. In order to make use of this capability, you need a GPU with dual copy engines for concurrent DMA transfers. Dual copy engines are a feature found on Tesla cards:

Please note that the total bandwidth required by simultaneous uploads and downloads, about 12 GB/sec with a PCIe gen2 bus, can overwhelm the system memory throughput of older host platforms.

thanks for the replies,
but i think for many GeForce devices (I dont know about newest Tesla), the asyncEngineCount of device properties only gives value = 1, which mean it only allows one copy at a time. I found also that the Telsa with CC 2.0 will only allow one H2D and one D2H at a time (bi-concurrent transfer), NOT two H2D copy at once. I found it here
so I concluded that the process H2D_a through stream_a & H2D_b through stream_b doesn’t give acceleration through concurrency.

however I’m not sure whether my conclusion is correct. is there anyone who can verify this?
thank you very much

Correct, there cannot be concurrent transfers in the same direction. I don’t think PCIe has any provisions for that. The two concurrent transfers would be in opposite directions, with one DMA engine copying down to the device, the other copying up to the host. In practice, this is all that is needed: results from a previous iteration are sent back to the host while new data is being transported to the device (the GPU) for processing. In many instances it is possible to completely, or at least mostly, have these copies run concurrently with kernel execution. Obviously the amount of overlap depends on kernel execution times and copy times, as well as data dependencies between kernels.

When I mentioned dual copy engines, I also stated that this is a feature of Tesla GPUs. More specifically, this would be Tesla GPUs with Fermi or Kepler architecture, such as C2050, C2070, C2075, M2050, M2070, M2090, K10, K20, K20X. At the Tesla-specific link I posted above it says:
Faster PCIe communication
The only NVIDIA product with two DMA engines for bi-directional PCIe communication

thanks a lot!