Ok, now it’s fine.
Meanwhile I started having a look at the NNCL2 documentation.
I actually have with NCCL1 (GitHub version) the concurrency problem mentioned in the documentation (I simply pasted in the following the whole paragragh “Concurrency between NCCL and CUDA calls”).
Do you have some sample code or more detailed post where I can have a look at the work-around proposed?
The problem is really quite annoyied and I need to find THE SOLUTION at the problem.
Actually I’m using a CPU barrier (mpi concept) to protect the entering in AllReduce Nickel.
But this solution is time wasting since cpu threads are tightly synchronized as well.
I’ll wait your comment on that,
Concurrency between NCCL and CUDA calls
NCCL uses CUDA kernels to perform inter-GPU communication. The NCCL kernels synchronize with each other, therefore, each kernel requires other kernels on other GPUs to be also executed in order to complete. The application should therefore make sure that nothing prevents the NCCL kernels from being executed concurrently on the different devices of a NCCL communicator.
For example, let’s say you have a process managing multiple CUDA devices, and, also features a thread which calls CUDA functions asynchronously. In this case, CUDA calls could be executed between the enqueuing of two NCCL kernels. The CUDA call may wait for the first NCCL kernel to complete and prevent the second one from being launched, causing a deadlock since the first kernel will not complete until the second one is executed. To avoid this issue, one solution is to have a lock around the NCCL launch on multiple devices (around ncclGroupStart and ncclGroupEnd when using a single thread, around the NCCL launch when using multiple threads, using thread synchronization if necessary) and take this lock when calling CUDA from the asynchronous thread.