Concurrent exec. of kernel and GPU mem copies

oscarbg · March 6, 2008, 10:28pm

Here there are…

I have a G92 (compute-11) which supports concurrent execution of kernel and device mem copies exposed via streams and Async functions. I have two questions regarding this functionality…

In CUDA 1.1 Release notes says

" Current hardware limits the number of asynchronous memcopies that can
be overlapped with kernel execution. Overlap is also limited to kernels
executing for less than 1 second. These limitations are expected to
improve on future hardware. "

Can someone at NVIDIA say what this number is for G92 cards …
Greater than 1? than 2?

This can be useful for creating roughly as much streams as this number to
maximize the use of this feature or I’m wrong?

Another question:
is possible using this feature and assuming the number stated before is greater than 2 to have a concurrent execution of a mem copy from GPU to CPU and viceversa to maximize the use of the avaiable bidirectional bandwith of PCIExpress bus?..

I.e. can one stream be doing a D2H to the GPU and another a H2D copy simultaneously? And the answer is no is possible to say if it’s a hardware limitation
on the last hardware (Compute 1.1) or a API limitation (in case the afromentioned number before is 1)

I hope someone brings me a clear picture of this technicalities.
Thx in advance.

mfatica · March 6, 2008, 10:31pm

On G92, you can overlap transfer in one direction (D2H or H2D) and compute.
It is an hardware limitation.

oscarbg · March 6, 2008, 10:37pm

Sorry I have forgotten to make the last question more general.

Is possible to have this feature on compute 1.0 hardware i.e are possible simultaneous D2H and H2D copies on this hardware using a stream for every one and usign Async memcpy?

I hope can bring some speed improvements to Compute 1.0 hardware ( in some problems )by using two streams one with a D2H memcpy to copy results from a just called kernel and another stream with a H2D to copy input data to the next kernel call

I supose this model in Compute 1.1 is possible or no?
Thx.

DenisR · March 7, 2008, 6:28am

this is a compute capability 1.1 feature, so no, you cannot do it with 1.0 hardware.

oscarbg · March 7, 2008, 7:28pm

First of all thanks M. Fatica for you fast response.

As I understand compute 1.1 allows simultaneous exec of kernel and memcompies, but

async mem copies (leaving cpu free for other loads…) are supported on compute 1.0 hardware too. My question is what about concurrent mem copies from/to device (specifically interested in concurrent D2H and H2D copy). From Fatica’s answer I understand that it’s not allowed because of hardware limitation (no on 1.0 and no on 1.1 hardware), but I realize that the streams abstraction could theoretically allow for enabled hardware to do this kind of operation (I speculate it’s because the GPU has only “one” DMA’s engine which must be programmed for either D2H or H2D copies).

mfatica · March 7, 2008, 9:55pm

Yes, the stream abstraction will allow this behavior.
If you use streams in your code now, once the proper hardware is released, it will overlap D2H, H2D and kernel execution.

Topic		Replies	Views
memory copy overlap CUDA Programming and Performance	7	14729	March 29, 2008
Asynchronous data transfer CUDA Programming and Performance	8	7085	May 15, 2008
Overlapping data transfers with kernel execution CUDA Programming and Performance	9	4558	March 13, 2009
Asynchronous HtoD memtransfer need to have it asynchronous for cpu, but synchronous for the GPU CUDA Programming and Performance	6	1014	September 9, 2010
Asynchronous memory copy from Host to Device CUDA Programming and Performance	5	3063	June 12, 2008
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9744	September 22, 2007
CUDA: combining H2D and D2H memory transfer operations CUDA Programming and Performance	7	3587	March 1, 2015
Conditions for Overlapping Kernel Execution and memcpy CUDA Programming and Performance cuda , kernel	1	35	May 24, 2025
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	992	February 1, 2022
Overlap Device2Host and Host2Device memcpy? How can we overlap two cudaMemcpy calls? CUDA Programming and Performance	4	4482	June 4, 2008

Concurrent exec. of kernel and GPU mem copies

Related topics