Here there are…
I have a G92 (compute-11) which supports concurrent execution of kernel and device mem copies exposed via streams and Async functions. I have two questions regarding this functionality…
In CUDA 1.1 Release notes says
" Current hardware limits the number of asynchronous memcopies that can
be overlapped with kernel execution. Overlap is also limited to kernels
executing for less than 1 second. These limitations are expected to
improve on future hardware. "
Can someone at NVIDIA say what this number is for G92 cards …
Greater than 1? than 2?
This can be useful for creating roughly as much streams as this number to
maximize the use of this feature or I’m wrong?
is possible using this feature and assuming the number stated before is greater than 2 to have a concurrent execution of a mem copy from GPU to CPU and viceversa to maximize the use of the avaiable bidirectional bandwith of PCIExpress bus?..
I.e. can one stream be doing a D2H to the GPU and another a H2D copy simultaneously? And the answer is no is possible to say if it’s a hardware limitation
on the last hardware (Compute 1.1) or a API limitation (in case the afromentioned number before is 1)
I hope someone brings me a clear picture of this technicalities.
Thx in advance.