Issues with Horovod multi-node connection over InfiniBand + GPUDirect RDMA

Environment:

  1. Framework: TensorFlow
  2. Framework version: TF 1.4
  3. Horovod version: 0.18.2 via Horovod in docker
  4. MPI version: 4.0.0
  5. CUDA version: 10.0
  6. NCCL version: .4.7-1
  7. Python version: 2.7
  8. OS and version: Ubuntu 18.06
  9. GCC version: 4.8
  10. Mellanox OFED 4.7.1
  11. GPUDirect RDMA - nvidia-peer-memory_1.0-8

Your question:

I am running the TF benchmarks in multi-node mode with the latest version of Horovod via docker but I am not seeing the output connection via NET/IB/0/GDRDMA (which means GPUDirect RDMA is enable) , see below the trace log

Tracelog

master_node:20:289 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>

master_node:20:289 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

master_node:20:289 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.

master_node:20:289 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>

NCCL version 2.4.7+cuda10.0

master_node:22:295 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>

master_node:22:295 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

master_node:21:290 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>

master_node:23:288 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>

master_node:21:290 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

master_node:23:288 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

master_node:22:295 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.

master_node:21:290 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.

master_node:23:288 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.

secondary_node:44:311 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>

secondary_node:41:312 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>

secondary_node:42:310 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>

secondary_node:43:309 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>

secondary_node:42:310 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

secondary_node:43:309 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

secondary_node:44:311 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

secondary_node:41:312 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

secondary_node:43:309 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.

secondary_node:44:311 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.

secondary_node:42:310 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.

secondary_node:41:312 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.

master_node:22:295 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>

master_node:23:288 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>

master_node:21:290 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>

secondary_node:43:309 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>

secondary_node:44:311 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>

secondary_node:41:312 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>

secondary_node:42:310 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>

master_node:20:289 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555

master_node:23:288 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa

master_node:21:290 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555

master_node:22:295 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa

secondary_node:44:311 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa

secondary_node:43:309 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa

secondary_node:41:312 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555

secondary_node:42:310 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555

secondary_node:41:312 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS

secondary_node:44:311 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE

secondary_node:42:310 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS

secondary_node:43:309 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE

master_node:22:295 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE

master_node:23:288 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE

master_node:21:290 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS

master_node:20:289 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS

master_node:20:289 [0] NCCL INFO Channel 00 : 0 1 3 6 4 5 7 2

master_node:20:289 [0] NCCL INFO Channel 01 : 0 1 3 6 4 5 7 2

master_node:22:295 [2] NCCL INFO Ring 00 : 7 → 2 [receive] via NET/IB/0

master_node:22:295 [2] NCCL INFO Ring 00 : 2[2] → 0[0] via P2P/IPC

secondary_node:43:309 [2] NCCL INFO Ring 00 : 3 → 6 [receive] via NET/IB/0

master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] → 3[3] via P2P/IPC

master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] → 1[1] via P2P/IPC

master_node:23:288 [3] NCCL INFO Ring 00 : 3 → 6 [send] via NET/IB/0

master_node:23:288 [3] NCCL INFO Ring 00 : 3[3] → 1[1] via P2P/IPC

secondary_node:43:309 [2] NCCL INFO Ring 00 : 6[2] → 4[0] via P2P/IPC

master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] → 0[0] via P2P/IPC

master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] → 2[2] via P2P/IPC

master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] → 3[3] via P2P/IPC

master_node:23:288 [3] NCCL INFO Ring 01 : 3 → 6 [send] via NET/IB/0

secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] → 7[3] via P2P/IPC

secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] → 5[1] via P2P/IPC

secondary_node:44:311 [3] NCCL INFO Ring 00 : 7 → 2 [send] via NET/IB/0

master_node:22:295 [2] NCCL INFO Ring 00 : 6 → 2 [receive] via NET/IB/0

master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] → 1[1] via P2P/IPC

master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] → 0[0] via P2P/IPC

secondary_node:44:311 [3] NCCL INFO Ring 00 : 7[3] → 5[1] via P2P/IPC

secondary_node:43:309 [2] NCCL INFO Ring 00 : 6 → 2 [send] via NET/IB/0

secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] → 4[0] via P2P/IPC

secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] → 6[2] via P2P/IPC

secondary_node:43:309 [2] NCCL INFO Ring 00 : 2 → 6 [receive] via NET/IB/0

master_node:22:295 [2] NCCL INFO Ring 00 : 2 → 6 [send] via NET/IB/0

master_node:22:295 [2] NCCL INFO Ring 01 : 7 → 2 [receive] via NET/IB/0

master_node:22:295 [2] NCCL INFO Ring 01 : 2[2] → 0[0] via P2P/IPC

secondary_node:43:309 [2] NCCL INFO Ring 01 : 3 → 6 [receive] via NET/IB/0

master_node:23:288 [3] NCCL INFO Ring 01 : 3[3] → 1[1] via P2P/IPC

master_node:21:290 [1] NCCL INFO Trees [0] 0->1->3/-1/-1 [1] 0->1->3/-1/-1

secondary_node:44:311 [3] NCCL INFO Ring 01 : 7 → 2 [send] via NET/IB/0

master_node:23:288 [3] NCCL INFO Trees [0] 1->3->-1/-1/-1 [1] 1->3->-1/-1/-1

master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] → 2[2] via P2P/IPC

secondary_node:43:309 [2] NCCL INFO Ring 01 : 6[2] → 4[0] via P2P/IPC

master_node:21:290 [1] NCCL INFO comm 0x7f4d6839f060 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE

master_node:23:288 [3] NCCL INFO comm 0x7f48503a3650 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE

master_node:20:289 [0] NCCL INFO Trees [0] 2->0->1/-1/-1 [1] 2->0->1/-1/-1

master_node:20:289 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes

secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] → 7[3] via P2P/IPC

secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] → 5[1] via P2P/IPC

master_node:20:289 [0] NCCL INFO comm 0x7f5450362840 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE

master_node:22:295 [2] NCCL INFO Ring 01 : 2 → 6 [send] via NET/IB/0

secondary_node:44:311 [3] NCCL INFO Ring 01 : 7[3] → 5[1] via P2P/IPC

secondary_node:43:309 [2] NCCL INFO Ring 01 : 2 → 6 [receive] via NET/IB/0

secondary_node:44:311 [3] NCCL INFO Trees [0] 5->7->-1/-1/-1 [1] 5->7->-1/-1/-1

master_node:22:295 [2] NCCL INFO Ring 01 : 6 → 2 [receive] via NET/IB/0

secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] → 4[0] via P2P/IPC

secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] → 6[2] via P2P/IPC

secondary_node:44:311 [3] NCCL INFO comm 0x7ff2c43f7c00 rank 7 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE

This is complex settingup. Please open support ticket by e-mail to support@mellanox.com. Thanks!