Queueing device-to-device/peer memcpy stalls concurrent copy operations

fknorr · June 10, 2024, 9:45pm

On our multi-GPU setup, we observe that submitting an async device-to-peer memcpy after a (long) kernel on stream A will stall other memcpys later submitted on stream B, which could otherwise execute concurrently with the first kernel.

// these should run in sequence (stream A)
long_running_kernel<<<..., streamA>>>();
cudaMemcpyAsync(memory1_on_dev1, memory1_on_dev0,  cudaMemcpyDeviceToDevice, streamA);
// this one should start immediately and run concurrently to the kernel above, but is delayed until after the above copy finishes
cudaMemcpyAsync(memory2_on_dev1, memory2_on_dev0,  cudaMemcpyDeviceToDevice, streamB);

From the host trace, it appears that neither of the d2d memcpys execute asynchronously at all. Replacing the first d2d memcpy with a d2h followed by a h2d allows the second copy to start immediately.

Full reproducer: d2d.cu.txt (3.3 KB)

The setup is 4x RTX 3090 on Linux 5.15 with CUDA 12.2. The NSys profile suggests that the memcpys are host-staged, which might play a role in the observed behavior (?).

Questions:

Is this expected behavior, and if so, what are the exact preconditions to observe it? Does it only trigger when “eagerly” submitting async peer-to-peer copies onto a stream that has already work on it? It appears that we can prevent this from happening by ensuring that we only ever submit d2d copies to a stream that has no pending work.
Is it possible to have our memcpys behave as “real” peer-to-peer copies without host staging, maybe circumventing this problem? According to 1. Introduction — CUDA C Programming Guide , it appears that we would have to disable the IOMMU or move to a virtual machine to make this work. Attempting to call cudaDeviceEnablePeerAccess crashes the LInux kernel on our machine, which suggests that maybe the hardware isn’t configured right. Ideally there would be a machine-independent solution to this.

Robert_Crovella · June 10, 2024, 11:16pm

That doesn’t look like an appropriate call to cudaDeviceEnablePeerAccess(). Have you read the documentation? Perhaps you don’t understand how to enable peer access. The simpleP2P sample code provides an example.

Without peer enablement, yes, the device-to-device copies will be host staged. And if peer access is not enabled, speaking for myself, I wouldn’t call it a “peer memcpy”

fknorr · June 11, 2024, 5:14am

Apologies, I was paraphrasing that code in the OP. The call in question was:

int canAccess = -1;
cudaDeviceCanAccessPeer(&canAccess, 0, 1); // => canAccess = 1
cudaSetDevice(0);
cudaDeviceEnablePeerAccess(1 /* peer */, 0 /* flags */); // => OS kernel lockup

That is certainly a bug worth fixing, but regardless there are probably systems where canAccess should be 0 above (maybe our system because of IOMMU?) and we need an understanding of how we can achieve async memcpy between device memories, even if CUDA stages it through the host. With host staging CUDA does some overlapping which is much faster than doing the full d2h → h2d in sequence manually.

fknorr · June 11, 2024, 6:00am

Investigating this some more, host-staged memcpy between devices also appears to host-synchronize with the stream and then cause the following operation on the same stream or device to host-synchronize as well.

cudaMemcpyAsync(memory_on_dev0, memory_on_dev1, size, cudaMemcpyDeviceToDevice, stream);
kernel<<<..., stream>>>();

gives

kernelAfterD2d

and

kernel<<<..., stream>>>();
cudaMemcpyAsync(memory_on_dev0, memory_on_dev1, size, cudaMemcpyDeviceToDevice, stream);

produces

d2dAfterKernel

I would expect both submissions to be async and control returning to the calling thread immediately instead of after the first operation having completed (streams are created non-blocking).

Robert_Crovella · June 11, 2024, 1:16pm

It’s been a while since I took a close look at the staging behavior. However I believe that the transfer proceeds in chunks, so that individual chunk transfers can overlap (i.e. the D2H and H2D portions of the staged transfer through host memory can enjoy some level of overlap with each other). This necessitates that the transfer proceed in stages, which necessitates CPU intervention. Such a transfer cannot be done directly by programming a DMA engine one time, the way it happens with typical async transfers. I’m fairly confident this mechanism is what gives rise to the host blocking behavior.

If you don’t like this, there are two options I can think of:

See if your platform is capable of P2P from a topology perspective. If so, work with the platform provider to fix the P2P issue you are witnessing. And it might be the case that RTX 3090 simply doesn’t support P2P via PCIE, in which case this item really refers to the idea of switching to a different platform that is P2P capable, whatever that may entail.
Take control of the transfer process yourself: create a host allocation of sufficient size that is pinned, and then do a D2H transfer to that allocation followed by H2D. This isn’t hugely worse than a staged transfer, which also has a D2H and H2D component. However, you will no longer get overlap and so the transfer duration may be longer. On the flip side, all of the requested operations can then become fully async, which seems to be your main goal.

No, I don’t have suggestions to get all the benefits of a P2P transfer when P2P support is not available at the platform level.

fknorr · June 11, 2024, 4:29pm

Thanks Robert for the detailed answer. It appears then that the best solution after fixing the system setup to enable P2P would be to do the D2H → H2D copy with some form of user-space chunking for larger transfers, tied together by events to keep things asynchronous.

system · June 25, 2024, 4:29pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
11.2 > cudaMemPool_t and Peer2Peer CUDA Programming and Performance	4	1154	January 14, 2021
2-way memcpy? CUDA Programming and Performance	7	984	April 16, 2015
Peer to peer (UVA) memcpy not working CUDA Programming and Performance cuda	1	96	November 15, 2024
cudaMemcpyBatchAsync cannot aggregate D2D copy operations CUDA Programming and Performance	13	139	December 9, 2025
cudaMemcpyAsync waiting for another unrelated cudaMemcpyAsync CUDA Programming and Performance cuda	10	186	December 10, 2024
cudaMemcpyPeerAsync behavior for different hardware CUDA Programming and Performance cuda	6	633	May 13, 2024
Understanding cudaMemcpyPeerAsync CUDA Programming and Performance	1	3685	February 25, 2014
Asynchronous Memcpy's not overlapping with asynchronous kernel execution despite using cuda streams? CUDA Programming and Performance cuda	4	1206	October 31, 2022
Cuda stream stalls due to memcpyAsync --- even when memory copy performing is idle? CUDA Programming and Performance cuda , kernel , profiling	2	669	October 7, 2023
some memcopy questions async, ping pong buffering, streaming CUDA Programming and Performance	5	3403	April 29, 2008

Queueing device-to-device/peer memcpy stalls concurrent copy operations

Related topics