Hi all,
I am trying to understand the real meaning of asyncEngineCount == 3 on Hopper GPUs such as H100 and H20.
In my test, I launch three host threads, and each thread asynchronously submits cudaMemcpyAsync(..., cudaMemcpyHostToDevice, stream) to its own non-default stream. The host buffers are pinned, and the three threads submit at the same time as much as possible.
However, when I inspect the timeline in Nsight Systems, I observe the following:
-
The red region shows that the three CPU threads are indeed issuing
cudaMemcpyAsyncconcurrently. -
The green region shows the actual
Memcpy HtoDoperations in the three CUDA streams. -
The three
Memcpy HtoDoperations appear to be serialized, rather than executing simultaneously.
So although asyncEngineCount reports 3, I do not see three HtoD copies running in parallel.
My questions are:
-
On Hopper, what exactly does
asyncEngineCount == 3mean? -
Does it mean that three copy engines can execute transfers in parallel?
This question is important for a small-IO scenario I am studying.
In the small-IO case, the effective bandwidth is heavily determined by submission latency. My expectation was that if the GPU has multiple copy engines, then using multiple CPU threads to concurrently issue cudaMemcpyAsync might improve throughput by overlapping several HtoD transfers.
But in practice, increasing CPU-side concurrency does not multiply bandwidth. That seems to imply that even if the GPU has multiple copy engines, small HtoD transfers still cannot effectively run in parallel to improve performance.
Thanks in advance.
