Hello, I am trying to train a model using tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce())
with 8 GPUs. However, when I execute model.fit(train_dataset, epochs=EPOCHS, callbacks=callbacks)
, it is frozen and training doesn’t even start. However, the trianing works fine with up to 2 GPUs.
Here are the logs from the NCCL tests:
make[1]: Entering directory '/ssd-data/Coefficient/nccl-tests/src'
Compiling timer.cc > /ssd-data/Coefficient/nccl-tests/build/timer.o
Compiling /ssd-data/Coefficient/nccl-tests/build/verifiable/verifiable.o
Compiling all_reduce.cu > /ssd-data/Coefficient/nccl-tests/build/all_reduce.o
Compiling common.cu > /ssd-data/Coefficient/nccl-tests/build/common.o
Linking /ssd-data/Coefficient/nccl-tests/build/all_reduce.o > /ssd-data/Coefficient/nccl-tests/build/all_reduce_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling all_gather.cu > /ssd-data/Coefficient/nccl-tests/build/all_gather.o
Linking /ssd-data/Coefficient/nccl-tests/build/all_gather.o > /ssd-data/Coefficient/nccl-tests/build/all_gather_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling broadcast.cu > /ssd-data/Coefficient/nccl-tests/build/broadcast.o
Linking /ssd-data/Coefficient/nccl-tests/build/broadcast.o > /ssd-data/Coefficient/nccl-tests/build/broadcast_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling reduce_scatter.cu > /ssd-data/Coefficient/nccl-tests/build/reduce_scatter.o
Linking /ssd-data/Coefficient/nccl-tests/build/reduce_scatter.o > /ssd-data/Coefficient/nccl-tests/build/reduce_scatter_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling reduce.cu > /ssd-data/Coefficient/nccl-tests/build/reduce.o
Linking /ssd-data/Coefficient/nccl-tests/build/reduce.o > /ssd-data/Coefficient/nccl-tests/build/reduce_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling alltoall.cu > /ssd-data/Coefficient/nccl-tests/build/alltoall.o
Linking /ssd-data/Coefficient/nccl-tests/build/alltoall.o > /ssd-data/Coefficient/nccl-tests/build/alltoall_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling scatter.cu > /ssd-data/Coefficient/nccl-tests/build/scatter.o
Linking /ssd-data/Coefficient/nccl-tests/build/scatter.o > /ssd-data/Coefficient/nccl-tests/build/scatter_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling gather.cu > /ssd-data/Coefficient/nccl-tests/build/gather.o
Linking /ssd-data/Coefficient/nccl-tests/build/gather.o > /ssd-data/Coefficient/nccl-tests/build/gather_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling sendrecv.cu > /ssd-data/Coefficient/nccl-tests/build/sendrecv.o
Linking /ssd-data/Coefficient/nccl-tests/build/sendrecv.o > /ssd-data/Coefficient/nccl-tests/build/sendrecv_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
Compiling hypercube.cu > /ssd-data/Coefficient/nccl-tests/build/hypercube.o
Linking /ssd-data/Coefficient/nccl-tests/build/hypercube.o > /ssd-data/Coefficient/nccl-tests/build/hypercube_perf
/usr/bin/ld: warning: libcudart.so.12, needed by /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libnccl.so, may conflict with libcudart.so.11.0
make[1]: Leaving directory '/ssd-data/Coefficient/nccl-tests/src'
NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 32984 on hiticollab device 0 [0x01] NVIDIA RTX A6000
# Rank 1 Group 0 Pid 32984 on hiticollab device 1 [0x25] NVIDIA RTX A6000
# Rank 2 Group 0 Pid 32984 on hiticollab device 2 [0x41] NVIDIA RTX A6000
hiticollab:32984:32984 [32607] NCCL INFO Bootstrap : Using eno1:170.140.29.26<0>
hiticollab:32984:32984 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
hiticollab:32984:32984 [32765] NCCL INFO NET/Plugin : No plugin found, using internal implementation
hiticollab:32984:32984 [710] NCCL INFO cudaDriverVersion 11080
hiticollab:32984:32984 [32607] init.cc:1662 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'
hiticollab: Test NCCL failure common.cu:954 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. hiticollab pid 32984: Test failure common.cu:844
System information
- Linux Ubuntu 20.04
- TensorFlow 2.13.1
- Python 3.8.10
Based on my configuration, I followed, the recommended tensorflow compatability table
Based on NCCL test logs, there seems to be a problem with my installation. Would appreciate your help.