I am profiling a deep learning model, and the framework is tensorflow with NCCL.
I am sure there is a lot of traffic on NVLINK by checking the nvidia-smi.
The ncclAllReduce should make a lot of traffic.
However, I can not see any traffic by NVVP, and the NVLINK analysis is almost empty.(I attach the screen capture).
Will the transfer on NVLINK be shown in the timeline memcpy[D2D]?
I profile the model with the command
mpiexec --allow-run-as-root --bind-to socket -np 2 -x CUDA_VISIBLE_DEVICES=0,1 numactl -N 0 -m 0 nvprof -f -o /dev/shm/lennox/timeline.%q{OMPI_COMM_WORLD_RANK}.nvprof python vgg.py --layers 16 -b 32 -u batch -i 200 --log_dir=/data/learning/tmp/ --data_dir=/data/learning/tf/models/research/inception/inception/data/ILSVRC2012/