Why D2D transfers for a single GPU?

I am running a pytorch application on Ubuntu machine with nvidia GTX 1650. When I profile the application using nvidia nsight systems, I see an entry in the CUDA GPU Trace. It is a Device to Device transfer. After looking at the CUDA API responsible for this entry, I saw that cudaMemcpyAsync is responsible for this transaction. However, I don’t understand this behaviour as I have only one GPU on my workstation. Under what circumstances does a cudaMemcpyAsync cause device to device transfers when there is a single GPU…?
Here is the output of nvidia-smi

Wed Dec 14 19:44:32 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 24%   37C    P5    N/A /  75W |    485MiB /  3908MiB |     18%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Thanks

One possibility is that your async memory copy is trying to used paged memory, and the device to device you see is actually the async being degraded to sync.

Can you run the expert systems rule cuda-async-memcopy either from the CLI (User Guide :: Nsight Systems Documentation (direct pointer, not matter what the text says)) or from the GUI (go to the event pane drop down in the bottom part of the screen and select the expert system and then this rule)?

Here is the output of the command:

Duration	Start	Src Kind	Dst Kind	Bytes	PID	Device ID	Context ID	Stream ID	API Name
28.353 ms	15.1726s	Pageable	Device	250.05 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.858 ms	15.2648s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.848 ms	15.2899s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.793 ms	15.2785s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.790 ms	15.3207s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.783 ms	15.2362s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.773 ms	15.2674s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.773 ms	15.2121s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.771 ms	15.2387s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.765 ms	15.2526s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.755 ms	15.2442s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.755 ms	15.2467s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.739 ms	15.2146s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.732 ms	15.2989s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.730 ms	15.2282s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.726 ms	15.2551s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.722 ms	15.2227s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.721 ms	15.2306s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.718 ms	15.2203s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.715 ms	15.2761s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.713 ms	15.3233s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.710 ms	15.3123s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.709 ms	15.3098s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.700 ms	15.3015s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
1.698 ms	15.2874s	Pageable	Device	16.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
538.300 μs	15.2948s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
506.331 μs	15.2821s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
497.498 μs	15.2501s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
475.609 μs	15.2638s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
470.840 μs	15.2172s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
458.296 μs	15.263s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
456.985 μs	15.2941s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
453.304 μs	15.298s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
449.303 μs	15.2813s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
445.687 μs	15.2957s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
443.608 μs	15.2622s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
437.015 μs	15.2509s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
433.334 μs	15.2933s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
433.175 μs	15.218s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
433.111 μs	15.2418s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
430.135 μs	15.209s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
430.071 μs	15.2752s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
428.150 μs	15.2266s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
427.639 μs	15.2615s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
427.639 μs	15.3061s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
427.446 μs	15.2353s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
426.679 μs	15.2492s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
426.646 μs	15.3054s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
426.422 μs	15.2729s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020
426.007 μs	15.2259s	Pageable	Device	4.00 MiB	45493	0	1	7	cudaMemcpyAsync_v3020

There is also a suggestion pointing in the same direction you gave.

The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy operations to block and be executed synchronously. This leads to low GPU utilization.
Suggestion: If applicable, use PINNED memory instead.

Thank you for the insight. Although, is there any source to logically justify this behavior? Why should there be a device-to-device transfer when the memory copy is synchronous and from paged memory?

I don’t know that there is a DtoD transfer going on, but we are being told by CUPTI that there is.

@jyi or @liuyis thoughts?

Hi @puneethnaik , from the documentation DtoD copy is expected on a single GPU, it means the source and destination are both on the GPU memory, i.e. not from HtoD or DtoH.

The memory copy across multiple GPUs are PtoP (Peer-to-Peer) copy.