My own situation/context is that I am working on multi-node multi-GPU deep learning using NCCL2 to all-reduce the gradients without MPI.
Assume I want to get rid of MPI, I have three questions:
The best way or most convenient way to do is to broadcast the ncclUniqueId using UDP socket?
For multi-node NCCL, we cannot use ncclCommInitAll instead of ncclCommInitRank?
Instead of broadcasting the ncclUniqueId, can we initialize all the communicators at one nodes and send them to different nodes instead?
Thanks a lot!