Hi,
I’ve been experimenting with nccl examples published in the official nccl documentation (NVIDIA Collective Communication Library (NCCL) Documentation — NCCL 2.19.3 documentation).
The part I want to ask about is the following:
//calling NCCL communication API. Group API is required when using
//multiple devices per thread
NCCLCHECK(ncclGroupStart());
for (int i = 0; i < nDev; ++i)
NCCLCHECK(ncclAllReduce((const void*)sendbuff[i], (void*)recvbuff[i], size, ncclFloat, ncclSum,
comms[i], s[i]));
NCCLCHECK(ncclGroupEnd());
The comment says the group api is required in the case as there are 4 (by default) devices used. I luckily have access to the exact setup like this and was able to run the example on it. It turns out that it repeatedly works well no matter if I keep the ncclGroupStart / End pair in place or get rid of it completely. I tried to break it somehow - reveal the expected deadlock - by changing the number of devices used in example #1 and #3, but it just does not appear.
Could you please comment on this? I mean, does this experiment just show that the deadlock is possible, but in this case very rare or it just cannot occur here, so the comment is wrong??