can NCCL be used in distributed environment? across machines.

ycx · August 10, 2018, 4:01am

I’m trying to use NCCL to do some acceleration of across machines inter GPU reduce.
I noticed NCCL 2.x supports internode communication, like below:

Key Features

Multi-gpu and multi-node communication collectives such as all-gather, all-reduce, broadcast, reduce, reduce-scatter
Automatic topology detection to determine optimal communication path
Optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect
Support multi-threaded and multiprocess applications
Multiple ring formations for high bus utilization
Support for InfiniBand verbs, RoCE and IP Socket internode communication

But the API, requires the src and dest addr:

ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t
count,
ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t
stream);

2.3. Data Pointers
In general NCCL will accept any CUDA pointers that are accessible from the CUDA device associated to the communicator object. This includes:
device memory local to the CUDA device
host memory registered using CUDA SDK APIs cudaHostRegister or cudaGetDevicePointer
managed and unified memory
The only exception is device memory located on another device but accessible from the current device using peer access. NCCL will return an error in that case to avoid programming errors (only when NCCL_CHECK_POINTERS=1 since 2.2.12).

It’s easy to get the GPU’s mem pinned, and use the pinned mem as the sendbuff and recvbuff.
But how can I get the sendbuff and recvbuff in a distributed environment? e.g. sendbuff is on GPU1 on host1, and recvbuff is on GPU2 on host2.

Can anyone help?

Thnaks.

Topic		Replies	Views
can NCCL be used in distributed environment? across machines. GPU-Accelerated Libraries	0	469	August 10, 2018
Scaling Deep Learning Training with NCCL Technical Blog	1	811	November 6, 2018
Fast Multi-GPU collectives with NCCL Technical Blog	14	1016	May 11, 2018
How to use NCCL to communicate between nodes? CUDA Programming and Performance cuda , openmpi	0	1313	June 19, 2023
Networking Reliability and Observability at Scale with NCCL 2.24 Technical Blog	1	16	March 13, 2025
Inter-GPU comunication CUDA Programming and Performance	3	11452	May 19, 2011
New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23 Technical Blog	1	27	January 31, 2025
How to use NCCL2 to communicate other server? CUDA Programming and Performance	1	674	February 12, 2018
How to use NCCL2 to communicate other server? CUDA Programming and Performance	0	544	August 18, 2017
Is there a NCCL 2.x for Windows? GPU-Accelerated Libraries	4	6410	August 1, 2020

can NCCL be used in distributed environment? across machines.

Related topics