Install and Use vLLM for Inference on two Sparks does not work

What do you mean by setting interfaces to IB mode? For NCCL you just need to ensure correct interface assignment in NCCL_SOCKET_IFNAME and proper RoCE interface in NCCL_IB_HCA (and having NCCL_IB_DISABLE=0).

The rest is to make sure necessary libraries are present - for example, NCCL might be installed, but rdma-core not, so it won’t use RDMA/IB and will fall back to Eth - which was happening in my case inside the docker container.

And another one is to make sure that infiniband devices are visible inside docker - you can pass /dev/infiniband (which he does) or just use privileged container (which I did, because I wasn’t sure if I needed to pass anything else in addition).

If all this is taken care of, NCCL will be able to use IB.