GPU-GPU Communication with nvprof

Hi everyone,
I’m trying to profile a distributed data parallel training of a deep learning model running on 2 GPUs. The model is implemented in PyTorch, and uses PyTorch’s NCCL back end for GPU communication.
I am using NVIDIA’s PyProf on Linux terminal (“aggregates kernel performance from Nsight Systems or nvprof”) and the outputs I’m getting for the kernel times are reasonable for computation.
But for communication, I am getting strange results:
First of all, PyProf behaves PyTorch’s nccl all_reduce/broadcast calls as kernels (namely ncclAllReduceRingLLKernel_sum_f32).

  1. Since I’m profiling both GPUs, I can compare the times spent for communication. There is a big difference in two GPUs for nccl kernels (10x to 1000x). When I measure nccl time with “cuda.Event’s elapsed_time” two GPUs seem to spend the same time and the bandwidth (calculated with that measurement) is reasonable.
  • Which measurement is the correct one?
  • Why PyProf can’t measure communication time?
  • Is cuda.Event good for profiling?
  1. Moreover the GPU that has larger time changes between training iterations. For example if GPU1 spends 1 sec for allreduce, and GPU2 spends 0.1 sec, after 3 iterations, GPU2 starts spending 1 sec while GPU1 spends 0.1
  2. Last question: When I change back end to gloo, it doesn’t give any output for communication. I wonder the reason for this.

Which version of Nsight Systems and nvprof are you using?
(you can find out nvprof version using “nvprof --version”)

The output for nvprof --version:
nvprof: NVIDIA ® Cuda command line profiler
Copyright © 2012 - 2018 NVIDIA Corporation
Release version 10.0.130 (21)

Nvidia Driver Version: 410.78
CUDA Version: 10.0

I have 2 GeForce GTX Titan X GPUs

Can you elaborate on what you mean by “communication time”? Are you wanting a sum of all dtoh, htod, or dtod memory operations on the GPU?

We are currently working on integrating PyProf into our DLProf tool, which will be fully available in the 20.06 PyTorch container released at the end of June. This will provide a much better analysis of the results. Unfortunately, none of our profiling tools (PyProf, DLProf) can handle multi-GPU profiling correctly at this time. It is on our very near roadmap.

If you have PyProf specific questions, you can always ask about them in the PyProf GitHub page ( The questions will be seen by the PyProf developers and experts.

Dear @dzier, thank you for your answer.
By communication time, I mean the communication between GPUs. So it’s device-to-device. In particular, the deep learning model runs in both GPUs with different data and the outputs are collected with “all_reduce” operation.
For communication, the deep learning model uses PyTorch’s nccl as I mentioned above. I am measuring the time spent for all_reduce operations.

“none of our profiling tools (PyProf, DLProf) can handle multi-GPU profiling correctly”: When I ran with --profile-child-processes, the other kernel times seemed correct to me (e.g. matrix multiplication, addition, softmax etc.). The suspicious part was the nccl kernels. Shouldn’t I trust the other outputs in multi-GPU case?

In your documentation, you have the profiling option with " torch.distributed.launch" and this means multiple GPUs/processes will be used. So I assumed its output will be correct.