ncclAllReduce hangs

I have a call to ncclAllReduce that hangs.

I have a program that is generated by a custom compiler for a custom language that I wrote. This system supports running multiple GPUs on multiple nodes. The minimal example where it hangs has 3 node, 1 process per node, 1 GPU per process, 1 thread per process. If I run with 2 nodes, (any subset of the 3 nodes) it does not hang.

The program does exactly the same computation on all 3 nodes. It takes the same branches and issues the same CUDA calls. I have run it in the past on a facility that has 6 nodes with 8 processes, 8 GPUs per node and it ran fine. That facility had Titan Vs and ran Debian stretch. I no longer have that facility and can’t run on it now to diagnose. My current facility has 2 DGX H100 and 1 SMC 8x H100 SXM5 all running Debian Bookworm with CUDA 12.0 and NCCL 2.19.3. My code runs under mpirun.

Upon startup, my code does:
ncclGetUniqId()
on just rank 0 and broadcasts the id to all 3 processes with OpenMPI. I have checked and all 3 nodes correctly get the id. then each node does:
ncclCommInitRank()
I have done this 3 different ways. The first is just calling
ncclCommInitRank()
The second is with:
ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
config.blocking = 1;
ncclCommInitRankgConfig(nccl_comm);
The third is with:
ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
config.blocking = 0;
ncclCommInitRankgConfig(nccl_comm);
ncclResult_t state;
do {
ncclCommGetAsyncError(nccl_com, &state);
} while (state!=ncclSuccess);

I presume that the first two create a blocking communicator and the third creates a nonblocking communicator. I actually don’t need a nonblocking communicator and prefer a blocking communicator.

I have a single stream which I create (on all 3 nodes) with:

cudaStreamCreate(&foreground_stream);

After this, the only other NCCL calls I make are to:

ncclAllReduce() with ncclSum and ncclFloat

For debugging purposes I do:

printf(“A %d\n”, my_rank);
cudaStreamSynchronize(foreground_stream);
printf(“B %d\n”, my_rank);
ncclAllReduce();
printf(“C %d\n”, my_rank);
ncclCommFinalize(nccl_comm);
printf(“D %d\n”, my_rank);
// begin this part only done with nonblocking communicators
ncclResult_t state;
do {
ncclCommGetAsyncError(nccl_comm, &state);
} while (state!=ncclSuccess);
/ end this part only done with nonblocking communicators
print(“E %d\n”, my_rank);
cudaStreamSynchronize(foreground_stream);
printf(“F %d\n”, my_rank);

In all cases, it prints A, B, C, D, and E on all 3 nodes for the first 18 calls and then on the 19th, it prints A, B, C, and D on all 3 nodes and then hangs.

The same thing happens with all 3 way I create the communicator.

First, I observe that ncclCommFinalize() returns ncclInvalidArgument (4). I don’t know why. No other call gives that error with nccl_com and I only pass it to ncclCommInitRank(), ncclCommInitRankconfig(), ncclCommGetAsyncError(), and ncclAllReduce(). So I remove it. But I get the same results both with it and without it.

Second, if I remove the calls to cudaStreamSynchronize before and after the call to ncclAllReduce, and use blocking communicators, no synchronization happens. Two nodes print 60 copies of A, B, C, D, and E then hang. One node prints 32 copies of A, B, C, D, and E, then prints one copy of !, B, C, and D then hangs.

I have done this both with NCCL_IB_DISABLE unset and set to 1 with the same behavior.

I have dones this with both 2.18.5 and 2.19.3 with the same behaviour.

Any help would be appreciated.

Note that I also have run this all with

foreground_stream = NULL;

to get the default stream which is supposed to be blocking and I get the same results.