can NCCL be used in distributed environment? across machines.

ycx · August 10, 2018, 5:58am

I’m trying to use NCCL to do some acceleration of across machines inter GPU reduce.
I noticed NCCL 2.x supports internode communication, like below:

Key Features

Multi-gpu and multi-node communication collectives such as all-gather, all-reduce, broadcast, reduce, reduce-scatter
Automatic topology detection to determine optimal communication path
Optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect
Support multi-threaded and multiprocess applications
Multiple ring formations for high bus utilization
Support for InfiniBand verbs, RoCE and IP Socket internode communication

But the API, requires the src and dest addr:

ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t
count,
ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t
stream);

2.3. Data Pointers
In general NCCL will accept any CUDA pointers that are accessible from the CUDA device associated to the communicator object. This includes:
device memory local to the CUDA device
host memory registered using CUDA SDK APIs cudaHostRegister or cudaGetDevicePointer
managed and unified memory
The only exception is device memory located on another device but accessible from the current device using peer access. NCCL will return an error in that case to avoid programming errors (only when NCCL_CHECK_POINTERS=1 since 2.2.12).

It’s easy to get the GPU’s mem pinned, and use the pinned mem as the sendbuff and recvbuff.
But how can I get the sendbuff and recvbuff in a distributed environment? e.g. sendbuff is on GPU1 on host1, and recvbuff is on GPU2 on host2.

Can anyone help?

Thanks.

Topic		Replies	Views
can NCCL be used in distributed environment? across machines. CUDA Programming and Performance	0	493	August 10, 2018
Fast Multi-GPU collectives with NCCL Technical Blog	14	1420	May 11, 2018
How to use NCCL to communicate between nodes? CUDA Programming and Performance cuda , openmpi	0	1529	June 19, 2023
NCCL2 across multiple nodes without MPI? CUDA Programming and Performance	6	3896	January 27, 2025
How to perform inter-GPU communication using NCCL2 across different hosts without MPI? GPU-Accelerated Libraries	1	1004	May 10, 2018
The NCCL communications on dual cpus and multi gpus GPU-Accelerated Libraries nccl	0	348	January 23, 2024
Scaling Deep Learning Training with NCCL Technical Blog	1	894	November 6, 2018
How to use NCCL2 to communicate other server? CUDA Programming and Performance	1	750	February 12, 2018
NCCL and D2D data moving across GPU devices CUDA Programming and Performance	0	1214	October 28, 2017
Unable to make nccl work Container: HPC	0	345	December 20, 2023

can NCCL be used in distributed environment? across machines.

Related topics