My own situation/context is that I am working on multi-node multi-GPU deep learning using NCCL2 to all-reduce the gradients without MPI.
Assume I want to get rid of MPI, I have three questions:
-
The best way or most convenient way to do is to broadcast the ncclUniqueId using UDP socket?
-
For multi-node NCCL, we cannot use ncclCommInitAll instead of ncclCommInitRank?
-
Instead of broadcasting the ncclUniqueId, can we initialize all the communicators at one nodes and send them to different nodes instead?
Thanks a lot!