Queueing device-to-device/peer memcpy stalls concurrent copy operations

On our multi-GPU setup, we observe that submitting an async device-to-peer memcpy after a (long) kernel on stream A will stall other memcpys later submitted on stream B, which could otherwise execute concurrently with the first kernel.

// these should run in sequence (stream A)
long_running_kernel<<<..., streamA>>>();
cudaMemcpyAsync(memory1_on_dev1, memory1_on_dev0,  cudaMemcpyDeviceToDevice, streamA);
// this one should start immediately and run concurrently to the kernel above, but is delayed until after the above copy finishes
cudaMemcpyAsync(memory2_on_dev1, memory2_on_dev0,  cudaMemcpyDeviceToDevice, streamB);

From the host trace, it appears that neither of the d2d memcpys execute asynchronously at all. Replacing the first d2d memcpy with a d2h followed by a h2d allows the second copy to start immediately.

Full reproducer: d2d.cu.txt (3.3 KB)

The setup is 4x RTX 3090 on Linux 5.15 with CUDA 12.2. The NSys profile suggests that the memcpys are host-staged, which might play a role in the observed behavior (?).

Questions:

  1. Is this expected behavior, and if so, what are the exact preconditions to observe it? Does it only trigger when “eagerly” submitting async peer-to-peer copies onto a stream that has already work on it? It appears that we can prevent this from happening by ensuring that we only ever submit d2d copies to a stream that has no pending work.
  2. Is it possible to have our memcpys behave as “real” peer-to-peer copies without host staging, maybe circumventing this problem? According to 1. Introduction — CUDA C Programming Guide , it appears that we would have to disable the IOMMU or move to a virtual machine to make this work. Attempting to call cudaDeviceEnablePeerAccess crashes the LInux kernel on our machine, which suggests that maybe the hardware isn’t configured right. Ideally there would be a machine-independent solution to this.

That doesn’t look like an appropriate call to cudaDeviceEnablePeerAccess(). Have you read the documentation? Perhaps you don’t understand how to enable peer access. The simpleP2P sample code provides an example.

Without peer enablement, yes, the device-to-device copies will be host staged. And if peer access is not enabled, speaking for myself, I wouldn’t call it a “peer memcpy”

Apologies, I was paraphrasing that code in the OP. The call in question was:

int canAccess = -1;
cudaDeviceCanAccessPeer(&canAccess, 0, 1); // => canAccess = 1
cudaSetDevice(0);
cudaDeviceEnablePeerAccess(1 /* peer */, 0 /* flags */); // => OS kernel lockup

That is certainly a bug worth fixing, but regardless there are probably systems where canAccess should be 0 above (maybe our system because of IOMMU?) and we need an understanding of how we can achieve async memcpy between device memories, even if CUDA stages it through the host. With host staging CUDA does some overlapping which is much faster than doing the full d2h → h2d in sequence manually.

Investigating this some more, host-staged memcpy between devices also appears to host-synchronize with the stream and then cause the following operation on the same stream or device to host-synchronize as well.

cudaMemcpyAsync(memory_on_dev0, memory_on_dev1, size, cudaMemcpyDeviceToDevice, stream);
kernel<<<..., stream>>>();

gives

kernelAfterD2d

and

kernel<<<..., stream>>>();
cudaMemcpyAsync(memory_on_dev0, memory_on_dev1, size, cudaMemcpyDeviceToDevice, stream);

produces

d2dAfterKernel

I would expect both submissions to be async and control returning to the calling thread immediately instead of after the first operation having completed (streams are created non-blocking).

It’s been a while since I took a close look at the staging behavior. However I believe that the transfer proceeds in chunks, so that individual chunk transfers can overlap (i.e. the D2H and H2D portions of the staged transfer through host memory can enjoy some level of overlap with each other). This necessitates that the transfer proceed in stages, which necessitates CPU intervention. Such a transfer cannot be done directly by programming a DMA engine one time, the way it happens with typical async transfers. I’m fairly confident this mechanism is what gives rise to the host blocking behavior.

If you don’t like this, there are two options I can think of:

  1. See if your platform is capable of P2P from a topology perspective. If so, work with the platform provider to fix the P2P issue you are witnessing. And it might be the case that RTX 3090 simply doesn’t support P2P via PCIE, in which case this item really refers to the idea of switching to a different platform that is P2P capable, whatever that may entail.

  2. Take control of the transfer process yourself: create a host allocation of sufficient size that is pinned, and then do a D2H transfer to that allocation followed by H2D. This isn’t hugely worse than a staged transfer, which also has a D2H and H2D component. However, you will no longer get overlap and so the transfer duration may be longer. On the flip side, all of the requested operations can then become fully async, which seems to be your main goal.

No, I don’t have suggestions to get all the benefits of a P2P transfer when P2P support is not available at the platform level.

Thanks Robert for the detailed answer. It appears then that the best solution after fixing the system setup to enable P2P would be to do the D2H → H2D copy with some form of user-space chunking for larger transfers, tied together by events to keep things asynchronous.