tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce()) doesn't work with more than 2 GPUs

Hello, I am trying to train a model using tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce()) with 8 GPUs. However, when I execute model.fit(train_dataset, epochs=EPOCHS, callbacks=callbacks), it is frozen and training doesn’t even start. However, the trianing works fine with up to 2 GPUs.

Here are the logs from the NCCL tests:

make[1]: Entering directory '/ssd-data/Coefficient/nccl-tests/src'
Compiling  timer.cc                            > /ssd-data/Coefficient/nccl-tests/build/timer.o
Compiling /ssd-data/Coefficient/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /ssd-data/Coefficient/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /ssd-data/Coefficient/nccl-tests/build/common.o
Linking  /ssd-data/Coefficient/nccl-tests/build/all_reduce.o > /ssd-data/Coefficient/nccl-tests/build/all_reduce_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  all_gather.cu                       > /ssd-data/Coefficient/nccl-tests/build/all_gather.o
Linking  /ssd-data/Coefficient/nccl-tests/build/all_gather.o > /ssd-data/Coefficient/nccl-tests/build/all_gather_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  broadcast.cu                        > /ssd-data/Coefficient/nccl-tests/build/broadcast.o
Linking  /ssd-data/Coefficient/nccl-tests/build/broadcast.o > /ssd-data/Coefficient/nccl-tests/build/broadcast_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  reduce_scatter.cu                   > /ssd-data/Coefficient/nccl-tests/build/reduce_scatter.o
Linking  /ssd-data/Coefficient/nccl-tests/build/reduce_scatter.o > /ssd-data/Coefficient/nccl-tests/build/reduce_scatter_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  reduce.cu                           > /ssd-data/Coefficient/nccl-tests/build/reduce.o
Linking  /ssd-data/Coefficient/nccl-tests/build/reduce.o > /ssd-data/Coefficient/nccl-tests/build/reduce_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  alltoall.cu                         > /ssd-data/Coefficient/nccl-tests/build/alltoall.o
Linking  /ssd-data/Coefficient/nccl-tests/build/alltoall.o > /ssd-data/Coefficient/nccl-tests/build/alltoall_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  scatter.cu                          > /ssd-data/Coefficient/nccl-tests/build/scatter.o
Linking  /ssd-data/Coefficient/nccl-tests/build/scatter.o > /ssd-data/Coefficient/nccl-tests/build/scatter_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  gather.cu                           > /ssd-data/Coefficient/nccl-tests/build/gather.o
Linking  /ssd-data/Coefficient/nccl-tests/build/gather.o > /ssd-data/Coefficient/nccl-tests/build/gather_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  sendrecv.cu                         > /ssd-data/Coefficient/nccl-tests/build/sendrecv.o
Linking  /ssd-data/Coefficient/nccl-tests/build/sendrecv.o > /ssd-data/Coefficient/nccl-tests/build/sendrecv_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  hypercube.cu                        > /ssd-data/Coefficient/nccl-tests/build/hypercube.o
Linking  /ssd-data/Coefficient/nccl-tests/build/hypercube.o > /ssd-data/Coefficient/nccl-tests/build/hypercube_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
make[1]: Leaving directory '/ssd-data/Coefficient/nccl-tests/src'

NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  32984 on hiticollab device  0 [0x01] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid  32984 on hiticollab device  1 [0x25] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid  32984 on hiticollab device  2 [0x41] NVIDIA RTX A6000
hiticollab:32984:32984 [32607] NCCL INFO Bootstrap : Using eno1:170.140.29.26<0>
hiticollab:32984:32984 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
hiticollab:32984:32984 [32765] NCCL INFO NET/Plugin : No plugin found, using internal implementation
hiticollab:32984:32984 [710] NCCL INFO cudaDriverVersion 11080

hiticollab:32984:32984 [32607] init.cc:1662 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'
hiticollab: Test NCCL failure common.cu:954 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. hiticollab pid 32984: Test failure common.cu:844

System information

  • Linux Ubuntu 20.04
  • TensorFlow 2.13.1
  • Python 3.8.10

Based on my configuration, I followed, the recommended tensorflow compatability table

Based on NCCL test logs, there seems to be a problem with my installation. Would appreciate your help.

Did you happen to find a solution. I’m facing a very similar issue with

  • Linux Ubuntu 20.04
  • Pytorch 2.3.1
  • CUDA 12.4
  • Python 3.8.10

I double checked and the nccl all reduce test runs for 2 gpus, but not 3 or more.