I profiled a CUDA kernel using Nsight Systems (nsys) and noticed that during the DToD phase, there are calls to cudaMemcpyAsync.
This confused me a bit, because my understanding was that cudaMemcpyAsync usually appears in HToD or DToH transfers.
Is it normal to see cudaMemcpyAsync in DToD during kernel execution or framework-level operations (e.g., PyTorch / CUDA runtime)?
What are the typical reasons for this behavior? For example:
- Internal tensor reordering or layout conversion?
- Temporary buffers or implicit copies introduced by the framework?
- Use of
cudaMemcpyAsyncto implement device-side copies instead of a kernel?
Any clarification on when and why DToD cudaMemcpyAsync appears would be greatly appreciated.
Thanks in advance!