Overlap Device2Host and Host2Device memcpy? How can we overlap two cudaMemcpy calls?

suhailrehman · May 26, 2008, 8:24am

I have a CUDA program that performs the following steps:

Load a chunk of data onto GPU (cudaMemcpy - Host to Device )
Launch a Kernel to work on that chunk
Read back the output from GPU and goto step 1 for the next chunk. (cudaMemcpy - Device to Host)

I’m working on a Quadro FX5600, a device that doesn’t support Asynchronous Concurrent Execution of Kernel and Memcpy. (1.0 compute capability)

Here are my questions:

Can we overlap steps 3 and 1 in my program? I.e, is it possible to overlap two cudaMemcpy instructions (1 being host2device and the other being device2host)? I am working with different chunks of memory on the host, and I have separate areas for input and output on the GPU.
Can we do step 1 with my device (G80, cuda 1.0 compute)?
If Yes, how do I do it?

Looking for any help in this regard…

suhailrehman · June 2, 2008, 7:12am

Well, anyways, I tried every possible combination on cuda 1.0 hardware and finally came up with the answer:

Calling async functions in 1.0 hardware will allow CPU/GPU concurrency. The moment another GPU function is called in 1.0 hardware, they will be serialized.

MisterAnderson42 · June 2, 2008, 1:49pm

Really? What test case do you have that shows this? My tests have always shown that there is a queue depth of 16 async calls on compute 1.0 hardware. This increases to 24 on compute 1.1 hardware.

suhailrehman · June 4, 2008, 8:16am

I tried implementing a streamed version of my application (with 2 streams) and ran on 1.0 hardware without any benefit.

The same application on 1.1 hardware (8600 M GT) gave about 1.8x speedup over non-streamed.

I then concluded that calls to async functions on 1.0 hardware were in effect being serialized. I was working on CUDA 2.0, and just tested on CUDA 1.1 Toolkit, with no avail.

Let me know if there is anyway I may be able to squeeze some more speed off the memcpys? PCI-E is supposed to have decent bidirectional bandwidth, which is why I was wondering if we can overlap the data send and data receive to/from GPU…

MisterAnderson42 · June 4, 2008, 1:48pm

Oh, I think I was confused about what you meant by async. In both compute 1.0 and 1.1 hardware: all kernel calls are asynchronous, meaning that they return immediately after the call (if you don’t overfill the queue, that is). So you can do CPU and GPU computation overlap no matter what hardware you have.

You seem to be referring to the concurrent async memcpy/kernel execution in 2 streams which is a different beast and of course only allows speedups on compute 1.1 hardware. Async memcpy calls on 1.0 hardware are serialized. Sorry for the confusion.

I think I remember it being posted by NVIDIA somewhere that there is a hardware limitation in the GPU that prevents concurrent read/write over PCI-e.

Edit: How much one-way bandwidth are you getting? PCI-e gen 1 parts with pinned memory top of at 2.5 - 3.5 GiB/s depending on the MB. PCI-e gen 2 parts range from 4 GiB/s (780i chipset) to ~5 GiB/s (P38).

Topic		Replies	Views
memory copy overlap CUDA Programming and Performance	7	14812	March 29, 2008
cudaMemcpyAsync Question Overlap HostToDevice and DeviceToHost trasfers CUDA Programming and Performance	2	5685	April 2, 2009
Asynchronous memory copy from Host to Device CUDA Programming and Performance	5	3138	June 12, 2008
Concurrent Data Transfers CUDA Programming and Performance	9	7788	April 27, 2012
Asynchronous data transfer CUDA Programming and Performance	8	7194	May 15, 2008
async memcpy only seems to overlap device->host CUDA Programming and Performance	0	973	August 17, 2009
Concurrent Kernel Execution / Memory Transfer We can't get it to work... CUDA Programming and Performance	5	4087	March 21, 2009
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2491	January 18, 2023
Concurrent exec. of kernel and GPU mem copies CUDA Programming and Performance	5	2968	March 7, 2008
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3278	July 14, 2010

Overlap Device2Host and Host2Device memcpy? How can we overlap two cudaMemcpy calls?

Related topics