Overlap Device2Host and Host2Device memcpy? How can we overlap two cudaMemcpy calls?

I have a CUDA program that performs the following steps:

  1. Load a chunk of data onto GPU (cudaMemcpy - Host to Device )
  2. Launch a Kernel to work on that chunk
  3. Read back the output from GPU and goto step 1 for the next chunk. (cudaMemcpy - Device to Host)

I’m working on a Quadro FX5600, a device that doesn’t support Asynchronous Concurrent Execution of Kernel and Memcpy. (1.0 compute capability)

Here are my questions:

  1. Can we overlap steps 3 and 1 in my program? I.e, is it possible to overlap two cudaMemcpy instructions (1 being host2device and the other being device2host)? I am working with different chunks of memory on the host, and I have separate areas for input and output on the GPU.

  2. Can we do step 1 with my device (G80, cuda 1.0 compute)?

  3. If Yes, how do I do it?

Looking for any help in this regard…

Well, anyways, I tried every possible combination on cuda 1.0 hardware and finally came up with the answer:

Calling async functions in 1.0 hardware will allow CPU/GPU concurrency. The moment another GPU function is called in 1.0 hardware, they will be serialized.

Really? What test case do you have that shows this? My tests have always shown that there is a queue depth of 16 async calls on compute 1.0 hardware. This increases to 24 on compute 1.1 hardware.

I tried implementing a streamed version of my application (with 2 streams) and ran on 1.0 hardware without any benefit.

The same application on 1.1 hardware (8600 M GT) gave about 1.8x speedup over non-streamed.

I then concluded that calls to async functions on 1.0 hardware were in effect being serialized. I was working on CUDA 2.0, and just tested on CUDA 1.1 Toolkit, with no avail.

Let me know if there is anyway I may be able to squeeze some more speed off the memcpys? PCI-E is supposed to have decent bidirectional bandwidth, which is why I was wondering if we can overlap the data send and data receive to/from GPU…

Oh, I think I was confused about what you meant by async. In both compute 1.0 and 1.1 hardware: all kernel calls are asynchronous, meaning that they return immediately after the call (if you don’t overfill the queue, that is). So you can do CPU and GPU computation overlap no matter what hardware you have.

You seem to be referring to the concurrent async memcpy/kernel execution in 2 streams which is a different beast and of course only allows speedups on compute 1.1 hardware. Async memcpy calls on 1.0 hardware are serialized. Sorry for the confusion.

I think I remember it being posted by NVIDIA somewhere that there is a hardware limitation in the GPU that prevents concurrent read/write over PCI-e.

Edit: How much one-way bandwidth are you getting? PCI-e gen 1 parts with pinned memory top of at 2.5 - 3.5 GiB/s depending on the MB. PCI-e gen 2 parts range from 4 GiB/s (780i chipset) to ~5 GiB/s (P38).