Three concurrent cudaMemcpyAsync(HtoD) calls still serialized on H100/H20 even when asyncEngineCount == 3

Hi all,

I am trying to understand the real meaning of asyncEngineCount == 3 on Hopper GPUs such as H100 and H20.

In my test, I launch three host threads, and each thread asynchronously submits cudaMemcpyAsync(..., cudaMemcpyHostToDevice, stream) to its own non-default stream. The host buffers are pinned, and the three threads submit at the same time as much as possible.

However, when I inspect the timeline in Nsight Systems, I observe the following:

  • The red region shows that the three CPU threads are indeed issuing cudaMemcpyAsync concurrently.

  • The green region shows the actual Memcpy HtoD operations in the three CUDA streams.

  • The three Memcpy HtoD operations appear to be serialized, rather than executing simultaneously.

So although asyncEngineCount reports 3, I do not see three HtoD copies running in parallel.

My questions are:

  1. On Hopper, what exactly does asyncEngineCount == 3 mean?

  2. Does it mean that three copy engines can execute transfers in parallel?

This question is important for a small-IO scenario I am studying.

In the small-IO case, the effective bandwidth is heavily determined by submission latency. My expectation was that if the GPU has multiple copy engines, then using multiple CPU threads to concurrently issue cudaMemcpyAsync might improve throughput by overlapping several HtoD transfers.

But in practice, increasing CPU-side concurrency does not multiply bandwidth. That seems to imply that even if the GPU has multiple copy engines, small HtoD transfers still cannot effectively run in parallel to improve performance.

Thanks in advance.

Copy engines are designed to saturate PCIe/C2C. If the device has more than 1 asynchronous copy engine then by default the CUDA driver assigns copy engines to different directions (H2D, D2H, and P2P). Copy engines are not designed to scale to device memory bandwidth so kernels are used for D2D.

The CUDA driver uses the front end inline memory copy for small copies to avoid read latency. Using multiple streams and mixed sizes (small and large) it is possible to get multiple simultaneous H2D in flight at the same time.