NCCL2 across multiple nodes without MPI?

Is it possible to use NCCL2 for e.g., allreduce across multiple nodes over TCP/IP, without using MPI?

I’ve seen this capability mentioned, but can’t find any way to specify the address of other nodes in the nccl docs, and the only examples (and horovod) seem to be using MPI.

For our application we need to set up a nccl communicator across multiple processes on separate ec2 machines, but are not using MPI.

NCCL indeed does not seem to require/use MPI.

Abstracting this example out

Initialization of NCCL needs the following steps:

if (!myRank) ncclGetUniqueId(&id)
<broadcast handle among all the processes/threads>
ncclCommInitRank(&comm, nRanks, id, myRank)

MPI is just used a way to broadcast the handle and to provide nRanks, myRank.

You could replace MPI here with whichever mechanism you wish.

Just curious, what mechanism do you intend to replace MPI with?

Hm, how does NCCL2 figure out the IP address of the other nodes? I am interested in using NCCL2 for cross machine allreduce. The handle is just an opaque identifier right?

We were thinking of using to coordinate the processes.

I tested broadcasting ncclUniqueId by other mechanism, and it worked well without MPI. Thanks!

I have similar problem as above which do not want to use MPI to broadcast the ncclUniqueId. My own situation/context is that I am working on multi-node multi-GPU deep learning using NCCL2 to all-reduce the gradients without MPI.

I have three questions:

  1. The best way or most convenient way to do is to broadcast the ncclUniqueId using UDP socket?

  2. For multi-node NCCL, we cannot use ncclCommInitAll instead of ncclCommInitRank?

  3. Instead of broadcasting the ncclUniqueId, can we initialize all the communicators at one nodes and send them to different nodes instead?

Thanks a lot!

Yes it is possible. For example, Pytorch does it with it’s torch.distributed TCPStore functionality.