CUDA concurrent D2H, H2D icrio

WIth CUDA, which cards series (GeForce/RTX/Tesla) and which microarchitectures support concurrent H2D and D2H memory transfers?

I remember that, in the past, Tesla cards supported these and GeForces cards did not. Has this changed? And - what about edit: Quadro cards?

Any GPU with 2 or more copy engines (discoverable via deviceQuery) should support simultaneous H2D and D2H transfers.

I think there are recent GPUs in any category that support this. I’m not aware of a table or list anywhere. I also think if you search hard enough, you can find (for example) Quadro GPUs with only 1 copy engine.

GeForce RTX 2080Ti, for example, seems to report 3 copy engines:

Quadro 2000 seems to report 1 copy engine

@Robert_Crovella : Other than querying the card when you have it - is there a listing of the number of copy engines for different cards?

I am not aware of any such list, maybe need to crowdsource one? Here is one more data point:

Device 0: "Quadro K420"
  CUDA Driver Version / Runtime Version          10.2 / 9.2
  CUDA Capability Major/Minor version number:    3.0
[...]
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)

Some more data points:

Quadro P620 has 2 copy engines.
Quadro RTX 6000 has 3 copy engines.
GTX 1050 Ti has 2 copy engines

Here are some more data points:

Device 0: "Quadro RTX 4000"
  CUDA Driver Version / Runtime Version          11.0 / 9.2
  CUDA Capability Major/Minor version number:    7.5
[...]
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)

Device 1: "Quadro P2000"
  CUDA Driver Version / Runtime Version          11.0 / 9.2
  CUDA Capability Major/Minor version number:    6.1
[...]
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)

Hi,
Can anybody tell me if there has been some change in ampere architecture? My 3080 reports only 1 copy engine while the 2080 SUPER ( which is supposed to be inferior compared to 3080 ), reports 3 copy engines.

@njuffa
@Robert_Crovella

I have zero experience with Ampere-based GPUs. I would suggest filing a bug report with NVIDIA.

Maybe (speculation!) NVIDIA has changed the definition of “copy engine” with Ampere and the new GPUs sport a new all-singing-all-dancing copy engine with multiple channels. If that were the case, reporting that as one engine would not be helpful to programmers who try to assess whether simultaneous bi-directional DMA transfers are possible.

I don’t happen to know the answer there, I suggest filing a bug report also.

Yes, I have filed a bug report for Windows 10.

The result is fine on Ubuntu 18.04 with CUDA 11.1, on which 2 async engines are reported, which is understandable since my variant ( MSI Ventus 3X OC ) does not have any NVLink or SLI connectors.

great, I hope you included all that info (that it appears to be correct on linux) in the bug report. Thanks for filing the bug, especially since it seems like it might be a driver issue.