Say we have 4 GPUs, each one is taken by a process. If I want to perform a 2-step communication:
(1) reduce data in GPU 0, 1, 2 into GPU 0
(2) broadcast data in GPU 0 to GPU 3.
I need to call ncclInitAll twice. In the first time, make GPU 0, 1, 2 as a group, broadcast nccluniqueID of GPU 0 to GPU 1 and 2. In the second time, make GPU 0 and GPU 3 as a group, broadcast nccluniqueID of GPU 0 to GPU 3.
I wonder if it can work. My questions are:
(1) Can we create two (or more) communicators with different ‘nranks’ and ‘rank’ values in only one process (and GPU)?
(2) In the two subgroup (0,1 ,2) and (0, 3). All the GPU know the same nccluniqueID (GPU 0’s). So in the reduce step, will GPU 3 be wrongly involved in the communication, and then cause a deadlock?