I’m trying to use NCCL to do some acceleration of across machines inter GPU reduce.
I noticed NCCL 2.x supports internode communication, like below:
Multi-gpu and multi-node communication collectives such as all-gather, all-reduce, broadcast, reduce, reduce-scatter
Automatic topology detection to determine optimal communication path
Optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect
Support multi-threaded and multiprocess applications
Multiple ring formations for high bus utilization
Support for InfiniBand verbs, RoCE and IP Socket internode communication
But the API, requires the src and dest addr:
ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t
ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t
2.3. Data Pointers
In general NCCL will accept any CUDA pointers that are accessible from the CUDA device associated to the communicator object. This includes:
device memory local to the CUDA device
host memory registered using CUDA SDK APIs cudaHostRegister or cudaGetDevicePointer
managed and unified memory
The only exception is device memory located on another device but accessible from the current device using peer access. NCCL will return an error in that case to avoid programming errors (only when NCCL_CHECK_POINTERS=1 since 2.2.12).
It’s easy to get the GPU’s mem pinned, and use the pinned mem as the sendbuff and recvbuff.
But how can I get the sendbuff and recvbuff in a distributed environment? e.g. sendbuff is on GPU1 on host1, and recvbuff is on GPU2 on host2.
Can anyone help?