Lack of ncclGroupStart / End in nccl examples does not lead to deadlock

Hi,
I’ve been experimenting with nccl examples published in the official nccl documentation (NVIDIA Collective Communication Library (NCCL) Documentation — NCCL 2.19.3 documentation).
The part I want to ask about is the following:

   //calling NCCL communication API. Group API is required when using
   //multiple devices per thread
  NCCLCHECK(ncclGroupStart());
  for (int i = 0; i < nDev; ++i)
    NCCLCHECK(ncclAllReduce((const void*)sendbuff[i], (void*)recvbuff[i], size, ncclFloat, ncclSum,
        comms[i], s[i]));
  NCCLCHECK(ncclGroupEnd());

The comment says the group api is required in the case as there are 4 (by default) devices used. I luckily have access to the exact setup like this and was able to run the example on it. It turns out that it repeatedly works well no matter if I keep the ncclGroupStart / End pair in place or get rid of it completely. I tried to break it somehow - reveal the expected deadlock - by changing the number of devices used in example #1 and #3, but it just does not appear.
Could you please comment on this? I mean, does this experiment just show that the deadlock is possible, but in this case very rare or it just cannot occur here, so the comment is wrong??