Why MemcpAsync happend in DToD?

I profiled a CUDA kernel using Nsight Systems (nsys) and noticed that during the DToD phase, there are calls to cudaMemcpyAsync.

This confused me a bit, because my understanding was that cudaMemcpyAsync usually appears in HToD or DToH transfers.

Is it normal to see cudaMemcpyAsync in DToD during kernel execution or framework-level operations (e.g., PyTorch / CUDA runtime)?

What are the typical reasons for this behavior? For example:

  • Internal tensor reordering or layout conversion?
  • Temporary buffers or implicit copies introduced by the framework?
  • Use of cudaMemcpyAsync to implement device-side copies instead of a kernel?

Any clarification on when and why DToD cudaMemcpyAsync appears would be greatly appreciated.
Thanks in advance!

There is nothing special. cudaMemcpy(Async) is the standard way to copy data between two buffers using the CUDA runtime API , and each buffer is allowed to live in either CPU memory or GPU memory.

So in total there are four possible copy directions: cpu to cpu, cpu to gpu, gpu to cpu, and gpu to gpu

1 Like