Why MemcpAsync happend in DToD?

hyaloids · December 25, 2025, 6:50am

I profiled a CUDA kernel using Nsight Systems (nsys) and noticed that during the DToD phase, there are calls to cudaMemcpyAsync.

This confused me a bit, because my understanding was that cudaMemcpyAsync usually appears in HToD or DToH transfers.

Is it normal to see cudaMemcpyAsync in DToD during kernel execution or framework-level operations (e.g., PyTorch / CUDA runtime)?

What are the typical reasons for this behavior? For example:

Internal tensor reordering or layout conversion?
Temporary buffers or implicit copies introduced by the framework?
Use of cudaMemcpyAsync to implement device-side copies instead of a kernel?

Any clarification on when and why DToD cudaMemcpyAsync appears would be greatly appreciated.
Thanks in advance!

striker159 · December 25, 2025, 9:36am

There is nothing special. cudaMemcpy(Async) is the standard way to copy data between two buffers using the CUDA runtime API , and each buffer is allowed to live in either CPU memory or GPU memory.

So in total there are four possible copy directions: cpu to cpu, cpu to gpu, gpu to cpu, and gpu to gpu

Topic		Replies	Views
What's the use of driver API "cuMemcpyDtoDAsync()"? CUDA Programming and Performance	1	1535	April 15, 2015
Why D2D transfers for a single GPU? Profiling Linux Targets	4	1019	December 15, 2022
cuMemcpyDtoHAsync acts like a Synconized Call CUDA Programming and Performance	2	636	October 31, 2019
cudaMemcpyAsync blocks and has long Runtime API duration CUDA Programming and Performance	0	471	December 10, 2016
MemcpyAsync DtoD transformed in DtoH Jetson Orin Nano cuda	2	74	November 19, 2024
cuMemcpyDtoA with stream CUDA Programming and Performance	0	1086	December 15, 2008
cudaMemcpyAsync (P2P D2D) serializes with kernel execution CUDA Programming and Performance	1	62	February 8, 2026
cudaMemcpyAsync problem CUDA Programming and Performance	9	3339	May 26, 2020
cudaMemcpyAsync HtoD and DtoH blocking each other CUDA Programming and Performance	4	602	April 25, 2024
cuMemcpyDtoD - example (expected perf) CUDA Programming and Performance	5	3936	September 2, 2008

Why MemcpAsync happend in DToD?

Related topics