tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce()) doesn't work with more than 2 GPUs

emad0525 · November 29, 2023, 12:11am

Hello, I am trying to train a model using tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce()) with 8 GPUs. However, when I execute model.fit(train_dataset, epochs=EPOCHS, callbacks=callbacks), it is frozen and training doesn’t even start. However, the trianing works fine with up to 2 GPUs.

Here are the logs from the NCCL tests:

make[1]: Entering directory '/ssd-data/Coefficient/nccl-tests/src'
Compiling  timer.cc                            > /ssd-data/Coefficient/nccl-tests/build/timer.o
Compiling /ssd-data/Coefficient/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /ssd-data/Coefficient/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /ssd-data/Coefficient/nccl-tests/build/common.o
Linking  /ssd-data/Coefficient/nccl-tests/build/all_reduce.o > /ssd-data/Coefficient/nccl-tests/build/all_reduce_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  all_gather.cu                       > /ssd-data/Coefficient/nccl-tests/build/all_gather.o
Linking  /ssd-data/Coefficient/nccl-tests/build/all_gather.o > /ssd-data/Coefficient/nccl-tests/build/all_gather_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  broadcast.cu                        > /ssd-data/Coefficient/nccl-tests/build/broadcast.o
Linking  /ssd-data/Coefficient/nccl-tests/build/broadcast.o > /ssd-data/Coefficient/nccl-tests/build/broadcast_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  reduce_scatter.cu                   > /ssd-data/Coefficient/nccl-tests/build/reduce_scatter.o
Linking  /ssd-data/Coefficient/nccl-tests/build/reduce_scatter.o > /ssd-data/Coefficient/nccl-tests/build/reduce_scatter_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  reduce.cu                           > /ssd-data/Coefficient/nccl-tests/build/reduce.o
Linking  /ssd-data/Coefficient/nccl-tests/build/reduce.o > /ssd-data/Coefficient/nccl-tests/build/reduce_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  alltoall.cu                         > /ssd-data/Coefficient/nccl-tests/build/alltoall.o
Linking  /ssd-data/Coefficient/nccl-tests/build/alltoall.o > /ssd-data/Coefficient/nccl-tests/build/alltoall_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  scatter.cu                          > /ssd-data/Coefficient/nccl-tests/build/scatter.o
Linking  /ssd-data/Coefficient/nccl-tests/build/scatter.o > /ssd-data/Coefficient/nccl-tests/build/scatter_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  gather.cu                           > /ssd-data/Coefficient/nccl-tests/build/gather.o
Linking  /ssd-data/Coefficient/nccl-tests/build/gather.o > /ssd-data/Coefficient/nccl-tests/build/gather_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  sendrecv.cu                         > /ssd-data/Coefficient/nccl-tests/build/sendrecv.o
Linking  /ssd-data/Coefficient/nccl-tests/build/sendrecv.o > /ssd-data/Coefficient/nccl-tests/build/sendrecv_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling  hypercube.cu                        > /ssd-data/Coefficient/nccl-tests/build/hypercube.o
Linking  /ssd-data/Coefficient/nccl-tests/build/hypercube.o > /ssd-data/Coefficient/nccl-tests/build/hypercube_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
make[1]: Leaving directory '/ssd-data/Coefficient/nccl-tests/src'

NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  32984 on hiticollab device  0 [0x01] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid  32984 on hiticollab device  1 [0x25] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid  32984 on hiticollab device  2 [0x41] NVIDIA RTX A6000
hiticollab:32984:32984 [32607] NCCL INFO Bootstrap : Using eno1:170.140.29.26<0>
hiticollab:32984:32984 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
hiticollab:32984:32984 [32765] NCCL INFO NET/Plugin : No plugin found, using internal implementation
hiticollab:32984:32984 [710] NCCL INFO cudaDriverVersion 11080

hiticollab:32984:32984 [32607] init.cc:1662 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'
hiticollab: Test NCCL failure common.cu:954 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. hiticollab pid 32984: Test failure common.cu:844

System information

Linux Ubuntu 20.04
TensorFlow 2.13.1
Python 3.8.10

Based on my configuration, I followed, the recommended tensorflow compatability table

Based on NCCL test logs, there seems to be a problem with my installation. Would appreciate your help.

gauthamnarayn · October 5, 2024, 8:20pm

Did you happen to find a solution. I’m facing a very similar issue with

Linux Ubuntu 20.04
Pytorch 2.3.1
CUDA 12.4
Python 3.8.10

I double checked and the nccl all reduce test runs for 2 gpus, but not 3 or more.

Topic		Replies	Views
tensorflow:19.12-tf2-py3 no multiple gpus Frameworks (archived) tensorflow	0	622	January 2, 2020
Problems migrating to multi-gpu setting Deep Learning (Training & Inference) pytorch , python , cloud	1	1502	March 5, 2024
Code runs in RTX 3060 but not in 4xTesla T4 Azure cluster Microsoft Azure Image pytorch , python , cudnn	0	483	March 5, 2024
Torch allreduce with low performance on cuda12.8 compatibility GPU-Accelerated Libraries cuda , pytorch , nccl	0	129	August 20, 2025
NCCL AllGather & AllReduce error CUDA Programming and Performance	1	2627	April 18, 2018
CUDA NCCL Error "operation not supported" Multi-GPUs CUDA Setup and Installation cuda	1	696	June 26, 2025
Unable to use multiple GPUs to train grounding dino TAO Toolkit cuda , tao	11	42	December 18, 2025
Tensorflow 2.18.0 MirroredStrategy Fail to Train with Multiple GPUs TensorRT cudnn	2	435	March 13, 2025
Multi-GPU training not working Frameworks (archived) cuda , tensorflow	0	503	April 12, 2020
NCCL failure : "unhandled system error" for 2 GPUs CUDA on Windows Subsystem for Linux	1	4333	January 21, 2021

tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce()) doesn't work with more than 2 GPUs

Related topics