Concurrent exec. of kernel and GPU mem copies

Here there are…

I have a G92 (compute-11) which supports concurrent execution of kernel and device mem copies exposed via streams and Async functions. I have two questions regarding this functionality…

In CUDA 1.1 Release notes says

" Current hardware limits the number of asynchronous memcopies that can
be overlapped with kernel execution. Overlap is also limited to kernels
executing for less than 1 second. These limitations are expected to
improve on future hardware. "

Can someone at NVIDIA say what this number is for G92 cards …
Greater than 1? than 2?

This can be useful for creating roughly as much streams as this number to
maximize the use of this feature or I’m wrong?

Another question:
is possible using this feature and assuming the number stated before is greater than 2 to have a concurrent execution of a mem copy from GPU to CPU and viceversa to maximize the use of the avaiable bidirectional bandwith of PCIExpress bus?..

I.e. can one stream be doing a D2H to the GPU and another a H2D copy simultaneously? And the answer is no is possible to say if it’s a hardware limitation
on the last hardware (Compute 1.1) or a API limitation (in case the afromentioned number before is 1)

I hope someone brings me a clear picture of this technicalities.
Thx in advance.

On G92, you can overlap transfer in one direction (D2H or H2D) and compute.
It is an hardware limitation.

Sorry I have forgotten to make the last question more general.

Is possible to have this feature on compute 1.0 hardware i.e are possible simultaneous D2H and H2D copies on this hardware using a stream for every one and usign Async memcpy?

I hope can bring some speed improvements to Compute 1.0 hardware ( in some problems )by using two streams one with a D2H memcpy to copy results from a just called kernel and another stream with a H2D to copy input data to the next kernel call

I supose this model in Compute 1.1 is possible or no?
Thx.

this is a compute capability 1.1 feature, so no, you cannot do it with 1.0 hardware.

First of all thanks M. Fatica for you fast response.

As I understand compute 1.1 allows simultaneous exec of kernel and memcompies, but

async mem copies (leaving cpu free for other loads…) are supported on compute 1.0 hardware too. My question is what about concurrent mem copies from/to device (specifically interested in concurrent D2H and H2D copy). From Fatica’s answer I understand that it’s not allowed because of hardware limitation (no on 1.0 and no on 1.1 hardware), but I realize that the streams abstraction could theoretically allow for enabled hardware to do this kind of operation (I speculate it’s because the GPU has only “one” DMA’s engine which must be programmed for either D2H or H2D copies).

Yes, the stream abstraction will allow this behavior.
If you use streams in your code now, once the proper hardware is released, it will overlap D2H, H2D and kernel execution.