I am running a pytorch application on Ubuntu machine with nvidia GTX 1650. When I profile the application using nvidia nsight systems, I see an entry in the CUDA GPU Trace. It is a Device to Device transfer. After looking at the CUDA API responsible for this entry, I saw that cudaMemcpyAsync is responsible for this transaction. However, I don’t understand this behaviour as I have only one GPU on my workstation. Under what circumstances does a cudaMemcpyAsync cause device to device transfers when there is a single GPU…?
Here is the output of nvidia-smi
Wed Dec 14 19:44:32 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 24% 37C P5 N/A / 75W | 485MiB / 3908MiB | 18% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
One possibility is that your async memory copy is trying to used paged memory, and the device to device you see is actually the async being degraded to sync.
Can you run the expert systems rule cuda-async-memcopy either from the CLI (User Guide :: Nsight Systems Documentation (direct pointer, not matter what the text says)) or from the GUI (go to the event pane drop down in the bottom part of the screen and select the expert system and then this rule)?
There is also a suggestion pointing in the same direction you gave.
The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy operations to block and be executed synchronously. This leads to low GPU utilization.
Suggestion: If applicable, use PINNED memory instead.
Thank you for the insight. Although, is there any source to logically justify this behavior? Why should there be a device-to-device transfer when the memory copy is synchronous and from paged memory?
Hi @puneethnaik , from the documentation DtoD copy is expected on a single GPU, it means the source and destination are both on the GPU memory, i.e. not from HtoD or DtoH.
The memory copy across multiple GPUs are PtoP (Peer-to-Peer) copy.