NCCL2 across multiple nodes without MPI?

ericliang · April 14, 2018, 4:32pm

Is it possible to use NCCL2 for e.g., allreduce across multiple nodes over TCP/IP, without using MPI?

I’ve seen this capability mentioned, but can’t find any way to specify the address of other nodes in the nccl docs, and the only examples (and horovod) seem to be using MPI.

For our application we need to set up a nccl communicator across multiple processes on separate ec2 machines, but are not using MPI.

gobucks · May 10, 2018, 5:17pm

NCCL indeed does not seem to require/use MPI.

Abstracting this example out
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#onedevprothrd:

Initialization of NCCL needs the following steps:

if (!myRank) ncclGetUniqueId(&id)
<broadcast handle among all the processes/threads>
ncclCommInitRank(&comm, nRanks, id, myRank)

MPI is just used a way to broadcast the handle and to provide nRanks, myRank.

You could replace MPI here with whichever mechanism you wish.

Just curious, what mechanism do you intend to replace MPI with?

ericliang · May 11, 2018, 4:04am

Hm, how does NCCL2 figure out the IP address of the other nodes? I am interested in using NCCL2 for cross machine allreduce. The handle is just an opaque identifier right?

We were thinking of using GitHub - ray-project/ray: Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for accelerating ML workloads. to coordinate the processes.

cloudict · September 17, 2018, 7:21am

I tested broadcasting ncclUniqueId by other mechanism, and it worked well without MPI. Thanks!

chrishkchris · June 18, 2019, 3:30am

I have similar problem as above which do not want to use MPI to broadcast the ncclUniqueId. My own situation/context is that I am working on multi-node multi-GPU deep learning using NCCL2 to all-reduce the gradients without MPI.

I have three questions:

The best way or most convenient way to do is to broadcast the ncclUniqueId using UDP socket?
For multi-node NCCL, we cannot use ncclCommInitAll instead of ncclCommInitRank?
Instead of broadcasting the ncclUniqueId, can we initialize all the communicators at one nodes and send them to different nodes instead?

Thanks a lot!

asp77 · May 17, 2021, 1:55pm

Yes it is possible. For example, Pytorch does it with it’s torch.distributed TCPStore functionality.

palok · January 27, 2025, 8:35am

@cloudict what was the other method? some socket program ?

Topic		Replies	Views
Concerning NCCL2.4 across multiple nodes without MPI Deep Learning (Training & Inference)	0	554	June 19, 2019
How to perform inter-GPU communication using NCCL2 across different hosts without MPI? GPU-Accelerated Libraries	1	1014	May 10, 2018
How to use NCCL to communicate between nodes? CUDA Programming and Performance cuda , openmpi	0	1545	June 19, 2023
How to use NCCL2 to communicate other server? CUDA Programming and Performance	1	758	February 12, 2018
How to run nccl-tests without MPI? CUDA Programming and Performance	0	480	August 1, 2024
Is it possible to use one GPU as the root of different communicator groups using NCCL? GPU-Accelerated Libraries	0	499	August 30, 2018
can NCCL be used in distributed environment? across machines. GPU-Accelerated Libraries	0	522	August 10, 2018
can NCCL be used in distributed environment? across machines. CUDA Programming and Performance	0	495	August 10, 2018
Fast Multi-GPU collectives with NCCL Technical Blog	14	1447	May 11, 2018
How can I tell whether NCCL is using PCIe or IB network interface while doing AllReduce? Deep Learning (Training & Inference)	0	840	March 6, 2020

NCCL2 across multiple nodes without MPI?

Related topics