[2024-08-15 21:16:55.177738: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
1: [DistributedMomentumOptimizer_Allreduce/cond_81/HorovodAllreduce_gradients_AddN_42_0, DistributedMomentumOptimizer_Allreduce/cond_82/HorovodAllreduce_gradients_AddN_41_0, DistributedMomentumOptimizer_Allreduce/cond_83/HorovodAllreduce_gradients_AddN_49_0, DistributedMomentumOptimizer_Allreduce/cond_84/HorovodAllreduce_gradients_AddN_48_0, DistributedMomentumOptimizer_Allreduce/cond_89/HorovodAllreduce_gradients_AddN_13_0, DistributedMomentumOptimizer_Allreduce/cond_92/HorovodAllreduce_gradients_box_predict_BiasAdd_grad_tuple_control_dependency_1_0 ...]
[502ac2b7bffe:411 :0:1429] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 1429) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000006bb17 ncclGroupEnd() ???:0
2 0x0000000000008609 start_thread() ???:0
3 0x000000000011f133 clone() ???:0
=================================
[502ac2b7bffe:00411] *** Process received signal ***
[502ac2b7bffe:00411] Signal: Segmentation fault (11)
[502ac2b7bffe:00411] Signal code: (-6)
[502ac2b7bffe:00411] Failing at address: 0x19b
[502ac2b7bffe:00411] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x73dea6695090]
[502ac2b7bffe:00411] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6bb17)[0x73ddc3cbbb17]
[502ac2b7bffe:00411] [ 2] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x73dea6637609]
[502ac2b7bffe:00411] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x73dea6771133]
[502ac2b7bffe:00411] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 502ac2b7bffe exited on signal 11 (Segmentation fault).
If I run with 1 gpu, everything works fine, I just get OOM after some steps, but it’s expected. 2, 3 or 4 GPU give the same error.
I read some other similar posts but none of them is toolkit 5.0.0, it looks like a problem with versions of cuda, tao and other things, do I need to run a older version? which one?
Check: sudo lspci -vvv | grep ACSCtl
– This returns empty, so I guess is OK.
Check: dmesg | grep IOMMU
– This also return empty, even after changing editing the line:
GRUB_CMDLINE_LINUX_DEFAULT=“quiet splash intel_iommu=on iommu=pt”, and running $ sudo update-grub and $ sudo shutdown -r now.
Could you run as below?
root@3c7bb4b1e648:/workspace/nccl-tests#export NCCL_P2P_LEVEL=NVL
root@3c7bb4b1e648:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
It worked now,
Also, I got an instance with bigger GPUs and more vCPUs, I think some operations were taking longer than expected and they were getting out of sync.
thanks.
root@c19a8e0f6ead:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1241 on c19a8e0f6ead device 0 [0x00] Tesla V100-SXM2-16GB
# Rank 1 Group 0 Pid 1241 on c19a8e0f6ead device 1 [0x00] Tesla V100-SXM2-16GB
# Rank 2 Group 0 Pid 1241 on c19a8e0f6ead device 2 [0x00] Tesla V100-SXM2-16GB
# Rank 3 Group 0 Pid 1241 on c19a8e0f6ead device 3 [0x00] Tesla V100-SXM2-16GB
# Rank 4 Group 0 Pid 1241 on c19a8e0f6ead device 4 [0x00] Tesla V100-SXM2-16GB
# Rank 5 Group 0 Pid 1241 on c19a8e0f6ead device 5 [0x00] Tesla V100-SXM2-16GB
# Rank 6 Group 0 Pid 1241 on c19a8e0f6ead device 6 [0x00] Tesla V100-SXM2-16GB
# Rank 7 Group 0 Pid 1241 on c19a8e0f6ead device 7 [0x00] Tesla V100-SXM2-16GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 43.78 0.00 0.00 0 45.52 0.00 0.00 0
16 4 float sum -1 45.51 0.00 0.00 0 45.18 0.00 0.00 0
32 8 float sum -1 45.64 0.00 0.00 0 46.93 0.00 0.00 0
64 16 float sum -1 47.06 0.00 0.00 0 44.74 0.00 0.00 0
128 32 float sum -1 47.71 0.00 0.00 0 45.69 0.00 0.00 0
256 64 float sum -1 45.83 0.01 0.01 0 51.23 0.00 0.01 0
512 128 float sum -1 44.60 0.01 0.02 0 48.61 0.01 0.02 0
1024 256 float sum -1 47.76 0.02 0.04 0 47.58 0.02 0.04 0
2048 512 float sum -1 47.20 0.04 0.08 0 47.64 0.04 0.08 0
4096 1024 float sum -1 47.34 0.09 0.15 0 48.51 0.08 0.15 0
8192 2048 float sum -1 46.21 0.18 0.31 0 49.14 0.17 0.29 0
16384 4096 float sum -1 47.92 0.34 0.60 0 50.02 0.33 0.57 0
32768 8192 float sum -1 46.64 0.70 1.23 0 47.66 0.69 1.20 0
65536 16384 float sum -1 52.74 1.24 2.17 0 49.55 1.32 2.31 0
131072 32768 float sum -1 55.55 2.36 4.13 0 54.29 2.41 4.23 0
262144 65536 float sum -1 64.34 4.07 7.13 0 64.35 4.07 7.13 0
524288 131072 float sum -1 74.47 7.04 12.32 0 75.53 6.94 12.15 0
1048576 262144 float sum -1 77.08 13.60 23.81 0 77.13 13.59 23.79 0
2097152 524288 float sum -1 99.15 21.15 37.02 0 97.35 21.54 37.70 0
4194304 1048576 float sum -1 145.9 28.74 50.29 0 147.1 28.52 49.92 0
8388608 2097152 float sum -1 233.5 35.93 62.88 0 234.3 35.81 62.66 0
16777216 4194304 float sum -1 288.2 58.21 101.87 0 289.4 57.97 101.45 0
33554432 8388608 float sum -1 490.8 68.36 119.63 0 488.7 68.66 120.15 0
67108864 16777216 float sum -1 921.5 72.82 127.44 0 922.2 72.77 127.34 0
134217728 33554432 float sum -1 1783.8 75.24 131.68 0 1787.0 75.11 131.44 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 27.3088
#
I still get the warning, but the training is going on…
[2024-08-17 11:53:19.924885: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
6: [DistributedMomentumOptimizer_Allreduce/cond_81/HorovodAllreduce_gradients_AddN_42_0, DistributedMomentumOptimizer_Allreduce/cond_82/HorovodAllreduce_gradients_AddN_41_0, DistributedMomentumOptimizer_Allreduce/cond_83/HorovodAllreduce_gradients_AddN_49_0, DistributedMomentumOptimizer_Allreduce/cond_84/HorovodAllreduce_gradients_AddN_48_0]