I’m trying to profile a distributed data parallel training of a deep learning model running on 2 GPUs. The model is implemented in PyTorch, and uses PyTorch’s NCCL back end for GPU communication.
I am using NVIDIA’s PyProf on Linux terminal (“aggregates kernel performance from Nsight Systems or nvprof”) and the outputs I’m getting for the kernel times are reasonable for computation.
But for communication, I am getting strange results:
First of all, PyProf behaves PyTorch’s nccl all_reduce/broadcast calls as kernels (namely ncclAllReduceRingLLKernel_sum_f32).
- Since I’m profiling both GPUs, I can compare the times spent for communication. There is a big difference in two GPUs for nccl kernels (10x to 1000x). When I measure nccl time with “cuda.Event’s elapsed_time” two GPUs seem to spend the same time and the bandwidth (calculated with that measurement) is reasonable.
- Which measurement is the correct one?
- Why PyProf can’t measure communication time?
- Is cuda.Event good for profiling?
- Moreover the GPU that has larger time changes between training iterations. For example if GPU1 spends 1 sec for allreduce, and GPU2 spends 0.1 sec, after 3 iterations, GPU2 starts spending 1 sec while GPU1 spends 0.1
- Last question: When I change back end to gloo, it doesn’t give any output for communication. I wonder the reason for this.