About Async(Copy) Engines

I want to know the number of asynchronous engines of the graphics cards I’m asking about.
And if the number of asynchronous engines is 1 instead of 2 or more, I’d like to know how to perform HtoD / DtoH simultaneously.
Additionally, if the original number of asynchronous engines of these graphics cards is 2 or more, but cudaGetDeviceProperties(&deviceProp, deviceI) returns 1, I’d like to know how to restore the original number of asynchronous engines.

Here is the list of graphics cards I own:

  1. RTX 3090 Ti

  2. RTX 3070

  3. RTX 3080

  4. RTX 5090

Consumer graphics card historically only enable 1 copy engine.
If there was a driver bug you may be able to see more copy engines if you revert to a much older driver. There is no guarantee that multiple copy engines were usable.

A single copy engine cannot simultaneously support H2D and D2H.
Small H2D copies generally do not use the copy engine.
If the host memory allocation is using pinned system memory a copy kernel is an efficient way to achieve parallel copies.

1 Like

Thank you for your answer. I have a couple of follow-up questions.

  1. Is there any official documentation that specifies the number of copy engines for specific GPU models? I couldn’t find this information in the Blackwell architecture diagrams.

  2. If so, does that mean I need to use a professional-grade GPU, like the NVIDIA RTX 6000, to utilize two or more asynchronous copy engines for simultaneous HtoD and DtoH transfers? I’m looking for the most cost-effective option with at least two copy engines.