More than 1 GPU not working using Tao Train

Please provide the following information when requesting support.

• Hardware 4x A6000 GPUs
• Network Type (Detectnet_v2)

Hello, I’m trying to train my model using 3 GPUs instead of 1. However when I run the Tao Train command I get the following error:

`Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 0 with PID 0 on node fc3a502973d9 exited on signal 7 (Bus error).

Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-03-01 08:57:18,531 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.`

I don’t understand what the error is telling me. I’ve attached the training log and error log for you to view.

training_log (5.9 KB)

error_log (93.9 KB)

Thanks.

From the error log,

2023-03-01 08:56:34,799 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-01 08:56:42,278 [INFO] tensorflow: Saving checkpoints for step-0.
fc3a502973d9:255:690 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
fc3a502973d9:255:690 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
fc3a502973d9:255:690 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
fc3a502973d9:255:690 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
fc3a502973d9:255:690 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
fc3a502973d9:255:690 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.1+cuda11.8
fc3a502973d9:255:690 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
fc3a502973d9:255:690 [0] NCCL INFO P2P plugin IBext
fc3a502973d9:255:690 [0] NCCL INFO NET/IB : No device found.
fc3a502973d9:255:690 [0] NCCL INFO NET/IB : No device found.
fc3a502973d9:255:690 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
fc3a502973d9:255:690 [0] NCCL INFO Using network Socket
fc3a502973d9:255:690 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
fc3a502973d9:255:690 [0] NCCL INFO Channel 00/02 :    0   1   2
fc3a502973d9:255:690 [0] NCCL INFO Channel 01/02 :    0   1   2
fc3a502973d9:255:690 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
fc3a502973d9:255:690 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
fc3a502973d9:255:690 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via SHM/direct/direct
fc3a502973d9:255:690 [0] NCCL INFO Connected all rings
[fc3a502973d9:255  :0:690] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:    690) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x00000000000755bd ncclGroupEnd()  ???:0
 3 0x000000000007a74f ncclGroupEnd()  ???:0
 4 0x0000000000059e67 ncclGetUniqueId()  ???:0
 5 0x0000000000048b3b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a5c2 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x0000000000064f66 ncclRedOpDestroy()  ???:0
 8 0x000000000004ae0b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 9 0x000000000004b068 ncclCommInitRank()  ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm()  /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute()  /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce()  /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
23 0x0000000000008609 start_thread()  ???:0
24 0x000000000011f133 clone()  ???:0
=================================
[fc3a502973d9:00255] *** Process received signal ***
[fc3a502973d9:00255] Signal: Bus error (7)
[fc3a502973d9:00255] Signal code:  (-6)
[fc3a502973d9:00255] Failing at address: 0x3e8000000ff
[fc3a502973d9:00255] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe894709090]
[fc3a502973d9:00255] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7fe894851b41]
[fc3a502973d9:00255] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x755bd)[0x7fe78606d5bd]
[fc3a502973d9:00255] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7a74f)[0x7fe78607274f]
[fc3a502973d9:00255] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e67)[0x7fe786051e67]
[fc3a502973d9:00255] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x48b3b)[0x7fe786040b3b]
[fc3a502973d9:00255] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a5c2)[0x7fe7860425c2]
[fc3a502973d9:00255] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x64f66)[0x7fe78605cf66]
[fc3a502973d9:00255] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae0b)[0x7fe786042e0b]
[fc3a502973d9:00255] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7fe786043068]
[fc3a502973d9:00255] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7fe769282354]
[fc3a502973d9:00255] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7fe769282581]
[fc3a502973d9:00255] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fe7692443cd]
[fc3a502973d9:00255] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7fe7692447fc]
[fc3a502973d9:00255] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7fe76921302d]
[fc3a502973d9:00255] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7fe893a71de4]
[fc3a502973d9:00255] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fe8946ab609]
[fc3a502973d9:00255] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fe8947e5133]
[fc3a502973d9:00255] *** End of error message ***

You are running with WSL, right?

Can you share the result of $nvidia-smi ?

hu Mar  2 15:21:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:3B:00.0 Off |                  Off |
|  0%   57C    P2   286W / 300W |   9916MiB / 49140MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:5E:00.0 Off |                  Off |
|  0%   45C    P8    20W / 300W |     15MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    Off  | 00000000:86:00.0 Off |                  Off |
|  0%   45C    P8    23W / 300W |     14MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    Off  | 00000000:AF:00.0  On |                  Off |
|  0%   47C    P8    30W / 300W |    641MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2288      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2855      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    798840      C   /usr/bin/python3.6                260MiB |
|    0   N/A  N/A    799061      C   python3.6                        9638MiB |
|    1   N/A  N/A      2288      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2855      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2288      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2855      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2288      G   /usr/lib/xorg/Xorg                110MiB |
|    3   N/A  N/A      2855      G   /usr/lib/xorg/Xorg                264MiB |
|    3   N/A  N/A      2985      G   /usr/bin/gnome-shell               92MiB |
|    3   N/A  N/A    642139      G   /usr/lib/firefox/firefox          157MiB |
+-----------------------------------------------------------------------------+

I’m not using WSL, I’m using an Ubuntu computer.

Thanks.

To narrow down, could you try to use 520 driver instead?

sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

sudo apt install nvidia-driver-520

More, to narrow down, can you run inside the tao docker and update the NCCL?
Refer to the steps mentioned in https://developer.nvidia.com/nccl/nccl-download,

then run the following command to installer NCCL:
For Ubuntu: sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0

More, which Ubuntu are you running? 20.04 or 22.04?

Hello,

so I tried installing nvidia-driver-520, however it just downloads driver 525 instead. Not sure why?

So I followed the instruction as you mentioned.

When I run sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0

I get

sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package libnccl2
E: Unable to locate package libnccl-dev

I am using Ubuntu 20.04

Can you run
$ apt-get update

I tried the command and I still get a similar error (shown as attached):
error_log (93.9 KB)

Sorry for late reply. Could you please docker pull the latest docker to run?

docker pull nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5

Then login the docker. Similar to below.
$ docker run --runtime=nvidia -it --rm -v yourlocalfolder:/workspace nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then, run command without “tao”.
# detectnet_v2 train blabla

Okay, so i changed up from using tao cli tool to running from docker directly.

I sucessfully created the tfrecords from the docker directly.

For training, this is the command i used:

detectnet_v2 train -e /workspace/specs/detectnet_v2_train_resnet18_kitti.txt  -r /workspace/detectnet_v2/experiment_dir_unpruned  -k tlt_encode  -n resnet18_detector  --gpus 4 

The result I got is shown below it still didn’t run with 4 GPUs.

 File "<frozen iva.detectnet_v2.scripts.train>", line 1011, in <module>
  File "<decorator-gen-117>", line 2, in main
  File "<frozen iva.detectnet_v2.utilities.timer>", line 46, in wrapped_fn
  File "<frozen iva.detectnet_v2.scripts.train>", line 994, in main
  File "<frozen iva.detectnet_v2.scripts.train>", line 853, in run_experiment
  File "<frozen iva.detectnet_v2.scripts.train>", line 680, in train_gridbox
  File "<frozen iva.detectnet_v2.training.training_proto_utilities>", line 109, in build_learning_rate_schedule
  File "<frozen moduluspy.modulus.hooks.utils>", line 40, in get_softstart_annealing_learning_rate
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/tf_should_use.py", line 198, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 173, in Assert
    guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1235, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 171, in true_assert
    condition, data, summarize, name="Assert")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 74, in _assert
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()



INFO:tensorflow:Saving checkpoints for step-64500.
2023-03-22 09:53:44,010 [INFO] tensorflow: Saving checkpoints for step-64500.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21229,1],2]
  Exit code:    1
--------------------------------------------------------------------------
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

To narrow down, did you ever run with 1gpu successfully?

Yes, I’ve even tried with 2 GPU and that was okay. Any more, an error occurs.

So, may I conclude that
1gpu → no error
2gpu → no error
3gpu → has error
4gpu → has error

yes correct

Are all the experiments using the same spec file? Could you share with us?

Yes, this is the training spec file
detectnet_v2_train_resnet18_kitti.txt (5.9 KB)

For 3gpus, can you use a new result folder and retry? For example,
-r /workspace/detectnet_v2/experiment_dir_unpruned_3gpu

For 4gpus, can you also use a new result folder and retry? For example,
-r /workspace/detectnet_v2/experiment_dir_unpruned_4gpu

So doing that with 4 GPU, I get the following error

2023-03-22 14:21:11,533 [INFO] __main__: Found 1400 samples in validation set
2023-03-22 14:21:11,533 [INFO] root: Rasterizing tensors.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:11,658 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:21:11,762 [INFO] root: Tensors rasterized.
2023-03-22 14:21:12,110 [INFO] root: Validation graph built.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:12,944 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:13,403 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:13,444 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:21:13,455 [INFO] root: Running training loop.
2023-03-22 14:21:13,456 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:21:13,456 [INFO] __main__: Scalars logged at every 10 steps
2023-03-22 14:21:13,456 [INFO] __main__: Images logged at every 2690 steps
INFO:tensorflow:Create CheckpointSaverHook.
2023-03-22 14:21:13,461 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:13,919 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2023-03-22 14:21:16,203 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:18,931 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:19,679 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-22 14:21:30,357 [INFO] tensorflow: Saving checkpoints for step-0.
5d5684b73250:6360:6921 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:6360:6921 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:6360:6921 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:6360:6921 [0] NCCL INFO P2P plugin IBext
5d5684b73250:6360:6921 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:6360:6921 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:6360:6921 [0] NCCL INFO Using network Socket
5d5684b73250:6360:6921 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:6360:6921 [0] NCCL INFO Channel 00/04 :    0   1   2   3
5d5684b73250:6360:6921 [0] NCCL INFO Channel 01/04 :    0   3   2   1
5d5684b73250:6360:6921 [0] NCCL INFO Channel 02/04 :    0   1   2   3
5d5684b73250:6360:6921 [0] NCCL INFO Channel 03/04 :    0   3   2   1
5d5684b73250:6360:6921 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
5d5684b73250:6360:6921 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:6360:6921 [0] NCCL INFO Channel 02 : 0[3b000] -> 1[5e000] via SHM/direct/direct
[5d5684b73250:6360 :0:6921] Caught signal 7 (Bus error: nonexistent physical address)
[5d5684b73250:6363 :0:6931] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   6921) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000007b0ef ncclGroupEnd()  ???:0
 4 0x0000000000059e97 ncclGetUniqueId()  ???:0
 5 0x00000000000489b1 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a655 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x00000000000652a6 ncclRedOpDestroy()  ???:0
 8 0x000000000004ae3b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 9 0x000000000004b098 ncclCommInitRank()  ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm()  /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute()  /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce()  /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
23 0x0000000000008609 start_thread()  ???:0
24 0x000000000011f133 clone()  ???:0
=================================
[5d5684b73250:06360] *** Process received signal ***
[5d5684b73250:06360] Signal: Bus error (7)
[5d5684b73250:06360] Signal code:  (-6)
[5d5684b73250:06360] Failing at address: 0x18d8
[5d5684b73250:06360] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f2d243b0090]
[5d5684b73250:06360] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7f2d244f8b41]
[5d5684b73250:06360] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7f2c15ce287d]
[5d5684b73250:06360] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7b0ef)[0x7f2c15ce80ef]
[5d5684b73250:06360] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e97)[0x7f2c15cc6e97]
[5d5684b73250:06360] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x489b1)[0x7f2c15cb59b1]
[5d5684b73250:06360] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a655)[0x7f2c15cb7655]
[5d5684b73250:06360] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x652a6)[0x7f2c15cd22a6]
[5d5684b73250:06360] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae3b)[0x7f2c15cb7e3b]
[5d5684b73250:06360] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7f2c15cb8098]
[5d5684b73250:06360] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7f2b6d7c1354]
[5d5684b73250:06360] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7f2b6d7c1581]
[5d5684b73250:06360] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7f2b6d7833cd]
[5d5684b73250:06360] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7f2b6d7837fc]
[5d5684b73250:06360] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7f2b6d75202d]
[5d5684b73250:06360] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f2d23718de4]
[5d5684b73250:06360] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f2d24352609]
[5d5684b73250:06360] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f2d2448c133]
[5d5684b73250:06360] *** End of error message ***
==== backtrace (tid:   6931) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000007b0ef ncclGroupEnd()  ???:0
 4 0x0000000000059e97 ncclGetUniqueId()  ???:0
 5 0x00000000000489b1 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a655 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x00000000000652a6 ncclRedOpDestroy()  ???:0
 8 0x000000000004ae3b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 9 0x000000000004b098 ncclCommInitRank()  ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm()  /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute()  /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce()  /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
23 0x0000000000008609 start_thread()  ???:0
24 0x000000000011f133 clone()  ???:0
=================================
[5d5684b73250:06363] *** Process received signal ***
[5d5684b73250:06363] Signal: Bus error (7)
[5d5684b73250:06363] Signal code:  (-6)
[5d5684b73250:06363] Failing at address: 0x18db
[5d5684b73250:06363] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff118fff090]
[5d5684b73250:06363] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7ff119147b41]
[5d5684b73250:06363] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7ff00a93187d]
[5d5684b73250:06363] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7b0ef)[0x7ff00a9370ef]
[5d5684b73250:06363] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e97)[0x7ff00a915e97]
[5d5684b73250:06363] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x489b1)[0x7ff00a9049b1]
[5d5684b73250:06363] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a655)[0x7ff00a906655]
[5d5684b73250:06363] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x652a6)[0x7ff00a9212a6]
[5d5684b73250:06363] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae3b)[0x7ff00a906e3b]
[5d5684b73250:06363] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7ff00a907098]
[5d5684b73250:06363] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7ff005d87354]
[5d5684b73250:06363] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7ff005d87581]
[5d5684b73250:06363] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7ff005d493cd]
[5d5684b73250:06363] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7ff005d497fc]
[5d5684b73250:06363] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7ff005d1802d]
[5d5684b73250:06363] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7ff118367de4]
[5d5684b73250:06363] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7ff118fa1609]
[5d5684b73250:06363] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff1190db133]
[5d5684b73250:06363] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 5d5684b73250 exited on signal 7 (Bus error).

For 3 GPUs , I get this error:

2023-03-22 14:24:52,808 [INFO] root: Tensors rasterized.
2023-03-22 14:24:53,000 [INFO] __main__: Found 8600 samples in training set
2023-03-22 14:24:53,006 [INFO] root: Rasterizing tensors.
2023-03-22 14:24:53,224 [INFO] root: Tensors rasterized.
2023-03-22 14:24:53,493 [INFO] root: Training graph built.
2023-03-22 14:24:53,493 [INFO] root: Running training loop.
2023-03-22 14:24:53,493 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:53,493 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:53,493 [INFO] __main__: Images logged at every 0 steps
2023-03-22 14:24:54,763 [INFO] root: Training graph built.
2023-03-22 14:24:54,763 [INFO] root: Running training loop.
2023-03-22 14:24:54,763 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:54,763 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:54,763 [INFO] __main__: Images logged at every 0 steps
INFO:tensorflow:Graph was finalized.
2023-03-22 14:24:54,791 [INFO] tensorflow: Graph was finalized.
2023-03-22 14:24:56,127 [INFO] root: Training graph built.
2023-03-22 14:24:56,127 [INFO] root: Building validation graph.
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 80, io threads: 160, compute threads: 80, buffered batches: 4
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 1400, number of sources: 1, batch size per gpu: 4, steps: 350
WARNING:tensorflow:Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,140 [WARNING] tensorflow: Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Graph was finalized.
2023-03-22 14:24:56,145 [INFO] tensorflow: Graph was finalized.
2023-03-22 14:24:56,159 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2023-03-22 14:24:56,393 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2023-03-22 14:24:56,397 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2023-03-22 14:24:56,397 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,412 [WARNING] tensorflow: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,640 [INFO] __main__: Found 1400 samples in validation set
2023-03-22 14:24:56,640 [INFO] root: Rasterizing tensors.
2023-03-22 14:24:56,857 [INFO] root: Tensors rasterized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:24:57,104 [INFO] tensorflow: Running local_init_op.
2023-03-22 14:24:57,184 [INFO] root: Validation graph built.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:24:57,570 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:24:58,493 [INFO] root: Running training loop.
2023-03-22 14:24:58,494 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:58,494 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:58,494 [INFO] __main__: Images logged at every 3585 steps
INFO:tensorflow:Create CheckpointSaverHook.
2023-03-22 14:24:58,497 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:24:58,607 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:24:59,102 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2023-03-22 14:25:01,129 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:25:03,774 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:25:04,510 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-22 14:25:14,773 [INFO] tensorflow: Saving checkpoints for step-0.
5d5684b73250:8162:8585 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:8162:8585 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:8162:8585 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:8162:8585 [0] NCCL INFO P2P plugin IBext
5d5684b73250:8162:8585 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:8162:8585 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:8162:8585 [0] NCCL INFO Using network Socket
5d5684b73250:8162:8585 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:8162:8585 [0] NCCL INFO Channel 00/02 :    0   1   2
5d5684b73250:8162:8585 [0] NCCL INFO Channel 01/02 :    0   1   2
5d5684b73250:8162:8585 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
5d5684b73250:8162:8585 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:8162:8585 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:8162:8585 [0] NCCL INFO Connected all rings
[5d5684b73250:8165 :0:9402] Caught signal 7 (Bus error: nonexistent physical address)
[5d5684b73250:8162 :0:8585] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   9402) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000006b246 ncclGroupEnd()  ???:0
 4 0x0000000000008609 start_thread()  ???:0
 5 0x000000000011f133 clone()  ???:0
=================================
[5d5684b73250:08165] *** Process received signal ***
[5d5684b73250:08165] Signal: Bus error (7)
[5d5684b73250:08165] Signal code:  (-6)
[5d5684b73250:08165] Failing at address: 0x1fe5
[5d5684b73250:08165] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff4efdb0090]
[5d5684b73250:08165] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7ff4efef8b41]
[5d5684b73250:08165] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7ff3e16e287d]
[5d5684b73250:08165] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6b246)[0x7ff3e16d8246]
[5d5684b73250:08165] [ 4] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7ff4efd52609]
[5d5684b73250:08165] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff4efe8c133]
[5d5684b73250:08165] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node 5d5684b73250 exited on signal 7 (Bus error).

Can you run nccl-test? Please run it inside the tao docker( I think you already login inside the tao docker)
Then, please run nccl-test as below for 3gpus or 4gpus.
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

Please share the log with us.