I built GitHub - NVIDIA/nccl-tests: NCCL Tests against NCCL2.1.15 and ran the tests on 2 8x1080Ti nodes connected by Infiniband.
The processes would all spin up, set the correct cuda device and then all start spinning at 100% GPU usage. When I turn on the warnings I see this message.
<hostname>:<time> [0] misc/ibvwrap.cu:241 WARN Call to ibv_reg_mr failed
When I set NCCL_IB_CUDA_SUPPORT=0 the tests complete fine. The documentation for NCCL_IB_CUDA_SUPPORT says that NCCL enables GPU Direct RDMA, if the topology permits it
but it does not detect that the 1080s do not support GPU Direct RDMA, which does not seem right.