TAO not running when using multiple GPUs

I’m using (g6.12xlarge pricing and specs - Vantage) to train a mask r-cnn model

I’m running the toolkit with the command:
docker run -it --rm --gpus all -v /home/ubuntu:/workspace nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

NVIDIA-SMI:
smi.txt (2.8 KB)

specs:
specs.txt (2.2 KB)

the command from inside docker:
mask_rcnn train -e /workspace/tao/specs/maskrcnn_train_resnet18.txt -d /workspace/tao/mask_rcnn/experiment_dir_unpruned --gpus 4

returns the error:
error.txt (18.0 KB)

[2024-08-15 21:16:55.177738: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
1: [DistributedMomentumOptimizer_Allreduce/cond_81/HorovodAllreduce_gradients_AddN_42_0, DistributedMomentumOptimizer_Allreduce/cond_82/HorovodAllreduce_gradients_AddN_41_0, DistributedMomentumOptimizer_Allreduce/cond_83/HorovodAllreduce_gradients_AddN_49_0, DistributedMomentumOptimizer_Allreduce/cond_84/HorovodAllreduce_gradients_AddN_48_0, DistributedMomentumOptimizer_Allreduce/cond_89/HorovodAllreduce_gradients_AddN_13_0, DistributedMomentumOptimizer_Allreduce/cond_92/HorovodAllreduce_gradients_box_predict_BiasAdd_grad_tuple_control_dependency_1_0 ...]
[502ac2b7bffe:411  :0:1429] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:   1429) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000006bb17 ncclGroupEnd()  ???:0
 2 0x0000000000008609 start_thread()  ???:0
 3 0x000000000011f133 clone()  ???:0
=================================
[502ac2b7bffe:00411] *** Process received signal ***
[502ac2b7bffe:00411] Signal: Segmentation fault (11)
[502ac2b7bffe:00411] Signal code:  (-6)
[502ac2b7bffe:00411] Failing at address: 0x19b
[502ac2b7bffe:00411] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x73dea6695090]
[502ac2b7bffe:00411] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6bb17)[0x73ddc3cbbb17]
[502ac2b7bffe:00411] [ 2] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x73dea6637609]
[502ac2b7bffe:00411] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x73dea6771133]
[502ac2b7bffe:00411] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 502ac2b7bffe exited on signal 11 (Segmentation fault).

If I run with 1 gpu, everything works fine, I just get OOM after some steps, but it’s expected. 2, 3 or 4 GPU give the same error.

I read some other similar posts but none of them is toolkit 5.0.0, it looks like a problem with versions of cuda, tao and other things, do I need to run a older version? which one?

Could you please run nccl test inside the docker?

$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

root@2286f4ee3cf4:/workspace/nccl-tests#  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1219 on 2286f4ee3cf4 device  0 [0x38] NVIDIA L4
#  Rank  1 Group  0 Pid   1219 on 2286f4ee3cf4 device  1 [0x3a] NVIDIA L4
#  Rank  2 Group  0 Pid   1219 on 2286f4ee3cf4 device  2 [0x3c] NVIDIA L4
#  Rank  3 Group  0 Pid   1219 on 2286f4ee3cf4 device  3 [0x3e] NVIDIA L4
[2286f4ee3cf4:1219 :0:1228] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@2286f4ee3cf4:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1236 on 2286f4ee3cf4 device  0 [0x38] NVIDIA L4
#  Rank  1 Group  0 Pid   1236 on 2286f4ee3cf4 device  1 [0x3a] NVIDIA L4
#  Rank  2 Group  0 Pid   1236 on 2286f4ee3cf4 device  2 [0x3c] NVIDIA L4
[2286f4ee3cf4:1236 :0:1246] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)

Please check ACS and IOMMU via TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - #27 by Morganh

Check: sudo lspci -vvv | grep ACSCtl
– This returns empty, so I guess is OK.

Check: dmesg | grep IOMMU
– This also return empty, even after changing editing the line:
GRUB_CMDLINE_LINUX_DEFAULT=“quiet splash intel_iommu=on iommu=pt”, and running $ sudo update-grub and $ sudo shutdown -r now.

Could you run as below?
root@3c7bb4b1e648:/workspace/nccl-tests#export NCCL_P2P_LEVEL=NVL
root@3c7bb4b1e648:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

It worked now,
Also, I got an instance with bigger GPUs and more vCPUs, I think some operations were taking longer than expected and they were getting out of sync.
thanks.

root@c19a8e0f6ead:/workspace/nccl-tests#  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1241 on c19a8e0f6ead device  0 [0x00] Tesla V100-SXM2-16GB
#  Rank  1 Group  0 Pid   1241 on c19a8e0f6ead device  1 [0x00] Tesla V100-SXM2-16GB
#  Rank  2 Group  0 Pid   1241 on c19a8e0f6ead device  2 [0x00] Tesla V100-SXM2-16GB
#  Rank  3 Group  0 Pid   1241 on c19a8e0f6ead device  3 [0x00] Tesla V100-SXM2-16GB
#  Rank  4 Group  0 Pid   1241 on c19a8e0f6ead device  4 [0x00] Tesla V100-SXM2-16GB
#  Rank  5 Group  0 Pid   1241 on c19a8e0f6ead device  5 [0x00] Tesla V100-SXM2-16GB
#  Rank  6 Group  0 Pid   1241 on c19a8e0f6ead device  6 [0x00] Tesla V100-SXM2-16GB
#  Rank  7 Group  0 Pid   1241 on c19a8e0f6ead device  7 [0x00] Tesla V100-SXM2-16GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    43.78    0.00    0.00      0    45.52    0.00    0.00      0
          16             4     float     sum      -1    45.51    0.00    0.00      0    45.18    0.00    0.00      0
          32             8     float     sum      -1    45.64    0.00    0.00      0    46.93    0.00    0.00      0
          64            16     float     sum      -1    47.06    0.00    0.00      0    44.74    0.00    0.00      0
         128            32     float     sum      -1    47.71    0.00    0.00      0    45.69    0.00    0.00      0
         256            64     float     sum      -1    45.83    0.01    0.01      0    51.23    0.00    0.01      0
         512           128     float     sum      -1    44.60    0.01    0.02      0    48.61    0.01    0.02      0
        1024           256     float     sum      -1    47.76    0.02    0.04      0    47.58    0.02    0.04      0
        2048           512     float     sum      -1    47.20    0.04    0.08      0    47.64    0.04    0.08      0
        4096          1024     float     sum      -1    47.34    0.09    0.15      0    48.51    0.08    0.15      0
        8192          2048     float     sum      -1    46.21    0.18    0.31      0    49.14    0.17    0.29      0
       16384          4096     float     sum      -1    47.92    0.34    0.60      0    50.02    0.33    0.57      0
       32768          8192     float     sum      -1    46.64    0.70    1.23      0    47.66    0.69    1.20      0
       65536         16384     float     sum      -1    52.74    1.24    2.17      0    49.55    1.32    2.31      0
      131072         32768     float     sum      -1    55.55    2.36    4.13      0    54.29    2.41    4.23      0
      262144         65536     float     sum      -1    64.34    4.07    7.13      0    64.35    4.07    7.13      0
      524288        131072     float     sum      -1    74.47    7.04   12.32      0    75.53    6.94   12.15      0
     1048576        262144     float     sum      -1    77.08   13.60   23.81      0    77.13   13.59   23.79      0
     2097152        524288     float     sum      -1    99.15   21.15   37.02      0    97.35   21.54   37.70      0
     4194304       1048576     float     sum      -1    145.9   28.74   50.29      0    147.1   28.52   49.92      0
     8388608       2097152     float     sum      -1    233.5   35.93   62.88      0    234.3   35.81   62.66      0
    16777216       4194304     float     sum      -1    288.2   58.21  101.87      0    289.4   57.97  101.45      0
    33554432       8388608     float     sum      -1    490.8   68.36  119.63      0    488.7   68.66  120.15      0
    67108864      16777216     float     sum      -1    921.5   72.82  127.44      0    922.2   72.77  127.34      0
   134217728      33554432     float     sum      -1   1783.8   75.24  131.68      0   1787.0   75.11  131.44      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 27.3088 
#

I still get the warning, but the training is going on…

[2024-08-17 11:53:19.924885: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
6: [DistributedMomentumOptimizer_Allreduce/cond_81/HorovodAllreduce_gradients_AddN_42_0, DistributedMomentumOptimizer_Allreduce/cond_82/HorovodAllreduce_gradients_AddN_41_0, DistributedMomentumOptimizer_Allreduce/cond_83/HorovodAllreduce_gradients_AddN_49_0, DistributedMomentumOptimizer_Allreduce/cond_84/HorovodAllreduce_gradients_AddN_48_0]

Do you mean the training can run now? Can you check the full log if it is still getting lots of “Missing ranks”?

yes training is running, I got “missing ranks”, yes:

Missing ranks:
6: [DistributedMomentumOptimizer_Allreduce/cond_81/HorovodAllreduce_gradients_AddN_42_0, DistributedMomentumOptimizer_Allreduce/cond_82/HorovodAllreduce_gradients_AddN_41_0, DistributedMomentumOptimizer_Allreduce/cond_83/HorovodAllreduce_gradients_AddN_49_0, DistributedMomentumOptimizer_Allreduce/cond_84/HorovodAllreduce_gradients_AddN_48_0]

Can you upload the latest full log?

training_log.txt (48.1 KB)

OK, it is running training.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.