Environment:
- Framework: TensorFlow
- Framework version: TF 1.4
- Horovod version: 0.18.2 via Horovod in docker
- MPI version: 4.0.0
- CUDA version: 10.0
- NCCL version: .4.7-1
- Python version: 2.7
- OS and version: Ubuntu 18.06
- GCC version: 4.8
- Mellanox OFED 4.7.1
- GPUDirect RDMA - nvidia-peer-memory_1.0-8
Your question:
I am running the TF benchmarks in multi-node mode with the latest version of Horovod via docker but I am not seeing the output connection via NET/IB/0/GDRDMA (which means GPUDirect RDMA is enable) , see below the trace log
Tracelog
master_node:20:289 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:20:289 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:20:289 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:20:289 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
NCCL version 2.4.7+cuda10.0
master_node:22:295 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:22:295 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:21:290 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:23:288 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:22:295 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:21:290 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:23:288 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:43:309 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:44:311 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:41:312 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:42:310 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:41:312 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:22:295 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
secondary_node:43:309 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:44:311 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
master_node:20:289 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
master_node:23:288 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
master_node:21:290 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
master_node:22:295 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:44:311 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:43:309 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:41:312 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
secondary_node:42:310 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
secondary_node:41:312 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
secondary_node:44:311 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
secondary_node:42:310 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
secondary_node:43:309 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:22:295 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:23:288 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
master_node:21:290 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO Channel 00 : 0 1 3 6 4 5 7 2
master_node:20:289 [0] NCCL INFO Channel 01 : 0 1 3 6 4 5 7 2
master_node:22:295 [2] NCCL INFO Ring 00 : 7 → 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2[2] → 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 3 → 6 [receive] via NET/IB/0
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] → 3[3] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] → 1[1] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 00 : 3 → 6 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 00 : 3[3] → 1[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6[2] → 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] → 0[0] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] → 2[2] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] → 3[3] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 01 : 3 → 6 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] → 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] → 5[1] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7 → 2 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 6 → 2 [receive] via NET/IB/0
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] → 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] → 0[0] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7[3] → 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6 → 2 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] → 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] → 6[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 2 → 6 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2 → 6 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 7 → 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 2[2] → 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 3 → 6 [receive] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 01 : 3[3] → 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Trees [0] 0->1->3/-1/-1 [1] 0->1->3/-1/-1
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7 → 2 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Trees [0] 1->3->-1/-1/-1 [1] 1->3->-1/-1/-1
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] → 2[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 6[2] → 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO comm 0x7f4d6839f060 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
master_node:23:288 [3] NCCL INFO comm 0x7f48503a3650 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
master_node:20:289 [0] NCCL INFO Trees [0] 2->0->1/-1/-1 [1] 2->0->1/-1/-1
master_node:20:289 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] → 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] → 5[1] via P2P/IPC
master_node:20:289 [0] NCCL INFO comm 0x7f5450362840 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
master_node:22:295 [2] NCCL INFO Ring 01 : 2 → 6 [send] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7[3] → 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 2 → 6 [receive] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Trees [0] 5->7->-1/-1/-1 [1] 5->7->-1/-1/-1
master_node:22:295 [2] NCCL INFO Ring 01 : 6 → 2 [receive] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] → 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] → 6[2] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO comm 0x7ff2c43f7c00 rank 7 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE