Lack of ncclGroupStart / End in nccl examples does not lead to deadlock

dariusz.sciebura · January 9, 2024, 9:50am

Hi,
I’ve been experimenting with nccl examples published in the official nccl documentation (NVIDIA Collective Communication Library (NCCL) Documentation — NCCL 2.19.3 documentation).
The part I want to ask about is the following:

   //calling NCCL communication API. Group API is required when using
   //multiple devices per thread
  NCCLCHECK(ncclGroupStart());
  for (int i = 0; i < nDev; ++i)
    NCCLCHECK(ncclAllReduce((const void*)sendbuff[i], (void*)recvbuff[i], size, ncclFloat, ncclSum,
        comms[i], s[i]));
  NCCLCHECK(ncclGroupEnd());

The comment says the group api is required in the case as there are 4 (by default) devices used. I luckily have access to the exact setup like this and was able to run the example on it. It turns out that it repeatedly works well no matter if I keep the ncclGroupStart / End pair in place or get rid of it completely. I tried to break it somehow - reveal the expected deadlock - by changing the number of devices used in example #1 and #3, but it just does not appear.
Could you please comment on this? I mean, does this experiment just show that the deadlock is possible, but in this case very rare or it just cannot occur here, so the comment is wrong??

Topic		Replies	Views
Is it possible to use one GPU as the root of different communicator groups using NCCL? GPU-Accelerated Libraries	0	449	August 30, 2018
Fast Multi-GPU collectives with NCCL Technical Blog	14	977	May 11, 2018
NCCL Error: “invalid device function” - Is it due to NCCL version incompatibility with CUDA 11.3? CUDA Setup and Installation cuda	0	34	January 20, 2025
ncclAllReduce hangs GPU-Accelerated Libraries nccl	1	720	December 18, 2023
ncclGroupEnd "unhandled cuda error" CUDA Programming and Performance	8	3284	October 23, 2020
How to use NCCL to communicate between nodes? CUDA Programming and Performance cuda , openmpi	0	1233	June 19, 2023
wglJoinSwapGroupNV OpenGL	3	2081	June 17, 2022
OpenCL program freezes when high number of kernels are launched within a loop CUDA Programming and Performance	1	921	October 8, 2013
Tesla C2050 - OpenCL - Kernel Concurrency Issue CUDA Programming and Performance	2	1186	June 3, 2014
Deadlock on cudaMalloc and cudaMemcpyDtoH in different thread in the same process CUDA Programming and Performance	4	1096	January 13, 2024

Lack of ncclGroupStart / End in nccl examples does not lead to deadlock

Related topics