Can nvprof profile inter-process peer to peer communication?

I’m trying to profile the data transfer amount of horovod data parallel training program with nvprof, which will spawn multiple process during training, one process per GPU. I’m wondering whether nvprof can profile all inter-process GPU peer to peer communication.

If nvprof could profile all inter-process GPU peer to peer communication, I have another question.
Since the output of nvprof only contains cudaMemcpy DtoD, no cudaMemcpy PtoP. If the answer to the previous question is YES, that means device to device communications are all recorded as cudaMemcpyDtoD, no matter it is peer to peer or not.

Is there any way that I can know whether the device to device data transfer is peer to peer or not?

Hi,

Nvprof can profile multiple processes.

You should use “–profile-child-processes” option if the application spawns multiple processes.

I have a few questions -

Are you using CUDA MPS to achieve inter-process GPU peer to peer communication?

Have you enabled bi-directional peer to peer access, using cudaDeviceEnablePeerAccess() for all GPU combinations.

Are all of the GPUs identical? Which GPU and toolkit are you using?

Hi, thanks for the reply

I did use --profile-child-process when profiling, there are multiple outputs after profiling.

Are you using CUDA MPS to achieve inter-process GPU peer to peer communication?
No

Have you enabled bi-directional peer to peer access, using cudaDeviceEnablePeerAccess() for all GPU combinations.
No, I didn’t write any cuda codes

All of my GPUs are Tesla P40

I just run the following command in docker container:
nvprof --print-gpu-trace --print-api-trace --profile-child-processes --normalized-time-unit s --csv --log-file nvprof.%p.out horovodrun -np 4 --log-level INFO python hvd.py 0

During the execution, the following messages are displayed:
[1,1]:9bd03eed07df:28626:28761 [1] NCCL INFO Ring 00 : 1[b6000] -> 2[b7000] via P2P/IPC
[1,2]:9bd03eed07df:28627:28762 [2] NCCL INFO Ring 00 : 2[b7000] -> 3[b8000] via P2P/IPC
[1,3]:9bd03eed07df:28628:28763 [3] NCCL INFO Ring 00 : 3[b8000] -> 0[b5000] via P2P/IPC
[1,0]:9bd03eed07df:28625:28769 [0] NCCL INFO Ring 00 : 0[b5000] -> 1[b6000] via P2P/IPC
[1,3]:9bd03eed07df:28628:28763 [3] NCCL INFO Ring 00 : 3[b8000] -> 2[b7000] via P2P/IPC
[1,1]:9bd03eed07df:28626:28761 [1] NCCL INFO Ring 00 : 1[b6000] -> 0[b5000] via P2P/IPC
[1,2]:9bd03eed07df:28627:28762 [2] NCCL INFO Ring 00 : 2[b7000] -> 1[b6000] via P2P/IPC
[1,3]:9bd03eed07df:28628:28763 [3] NCCL INFO Ring 01 : 3[b8000] -> 0[b5000] via P2P/IPC
[1,0]:9bd03eed07df:28625:28769 [0] NCCL INFO Ring 01 : 0[b5000] -> 1[b6000] via P2P/IPC
[1,2]:9bd03eed07df:28627:28762 [2] NCCL INFO Ring 01 : 2[b7000] -> 3[b8000] via P2P/IPC
[1,1]:9bd03eed07df:28626:28761 [1] NCCL INFO Ring 01 : 1[b6000] -> 2[b7000] via P2P/IPC
[1,3]:9bd03eed07df:28628:28763 [3] NCCL INFO Ring 01 : 3[b8000] -> 2[b7000] via P2P/IPC
[1,2]:9bd03eed07df:28627:28762 [2] NCCL INFO Ring 01 : 2[b7000] -> 1[b6000] via P2P/IPC
[1,1]:9bd03eed07df:28626:28761 [1] NCCL INFO Ring 01 : 1[b6000] -> 0[b5000] via P2P/IPC
According to the above messages, I think P2P is enable by horovod, and I can only found three types of data transfer records in the output of nvprof
[CUDA memcpy DtoD]
[CUDA memcpy DtoH]
[CUDA memcpy HtoD]

So I’m wondering whether peer to peer data transfer is really recorded and is there a way that I can distinguish the GPU peer to peer communication from those device to device communications that aren’t peer to peer?

Some P2Ps may be incorrectly classified as D2D.
We are looking into reproducing this issue and we will try and provide a fix if it is confirmed.

Hi, is there any update?
thanks

Hi,

Since horovod uses NCCL, nvprof will not be able to profile inter-GPU communication. This is because NCCL does not use cudaMemcpy operations but is performing inter-GPU communication by launching CUDA kernels accessing remote buffers. Those operations will not be shown as copies but as normal compute kernels (with a name starting with “nccl”).

The cudaMemcpyDtoD are likely coming from the the DL framework, i.e. Tensorflow, and they are local to the GPU.

Thanks a lot.
Is there any way that I can know the data transfer amount of nccl (such as using other profiling tools provided by nvidia)?