More than 1 GPU not working using Tao Train

More, to narrow down, can you run inside the tao docker and update the NCCL?
Refer to the steps mentioned in https://developer.nvidia.com/nccl/nccl-download,

then run the following command to installer NCCL:
For Ubuntu: sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0

More, which Ubuntu are you running? 20.04 or 22.04?

Hello,

so I tried installing nvidia-driver-520, however it just downloads driver 525 instead. Not sure why?

So I followed the instruction as you mentioned.

When I run sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0

I get

sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package libnccl2
E: Unable to locate package libnccl-dev

I am using Ubuntu 20.04

Can you run
$ apt-get update

I tried the command and I still get a similar error (shown as attached):
error_log (93.9 KB)

Sorry for late reply. Could you please docker pull the latest docker to run?

docker pull nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5

Then login the docker. Similar to below.
$ docker run --runtime=nvidia -it --rm -v yourlocalfolder:/workspace nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then, run command without “tao”.
# detectnet_v2 train blabla

Okay, so i changed up from using tao cli tool to running from docker directly.

I sucessfully created the tfrecords from the docker directly.

For training, this is the command i used:

detectnet_v2 train -e /workspace/specs/detectnet_v2_train_resnet18_kitti.txt  -r /workspace/detectnet_v2/experiment_dir_unpruned  -k tlt_encode  -n resnet18_detector  --gpus 4 

The result I got is shown below it still didn’t run with 4 GPUs.

 File "<frozen iva.detectnet_v2.scripts.train>", line 1011, in <module>
  File "<decorator-gen-117>", line 2, in main
  File "<frozen iva.detectnet_v2.utilities.timer>", line 46, in wrapped_fn
  File "<frozen iva.detectnet_v2.scripts.train>", line 994, in main
  File "<frozen iva.detectnet_v2.scripts.train>", line 853, in run_experiment
  File "<frozen iva.detectnet_v2.scripts.train>", line 680, in train_gridbox
  File "<frozen iva.detectnet_v2.training.training_proto_utilities>", line 109, in build_learning_rate_schedule
  File "<frozen moduluspy.modulus.hooks.utils>", line 40, in get_softstart_annealing_learning_rate
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/tf_should_use.py", line 198, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 173, in Assert
    guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1235, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 171, in true_assert
    condition, data, summarize, name="Assert")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 74, in _assert
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()



INFO:tensorflow:Saving checkpoints for step-64500.
2023-03-22 09:53:44,010 [INFO] tensorflow: Saving checkpoints for step-64500.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21229,1],2]
  Exit code:    1
--------------------------------------------------------------------------
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

To narrow down, did you ever run with 1gpu successfully?

Yes, I’ve even tried with 2 GPU and that was okay. Any more, an error occurs.

So, may I conclude that
1gpu → no error
2gpu → no error
3gpu → has error
4gpu → has error

yes correct

Are all the experiments using the same spec file? Could you share with us?

Yes, this is the training spec file
detectnet_v2_train_resnet18_kitti.txt (5.9 KB)

For 3gpus, can you use a new result folder and retry? For example,
-r /workspace/detectnet_v2/experiment_dir_unpruned_3gpu

For 4gpus, can you also use a new result folder and retry? For example,
-r /workspace/detectnet_v2/experiment_dir_unpruned_4gpu

So doing that with 4 GPU, I get the following error

2023-03-22 14:21:11,533 [INFO] __main__: Found 1400 samples in validation set
2023-03-22 14:21:11,533 [INFO] root: Rasterizing tensors.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:11,658 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:21:11,762 [INFO] root: Tensors rasterized.
2023-03-22 14:21:12,110 [INFO] root: Validation graph built.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:12,944 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:13,403 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:13,444 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:21:13,455 [INFO] root: Running training loop.
2023-03-22 14:21:13,456 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:21:13,456 [INFO] __main__: Scalars logged at every 10 steps
2023-03-22 14:21:13,456 [INFO] __main__: Images logged at every 2690 steps
INFO:tensorflow:Create CheckpointSaverHook.
2023-03-22 14:21:13,461 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:13,919 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2023-03-22 14:21:16,203 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:18,931 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:19,679 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-22 14:21:30,357 [INFO] tensorflow: Saving checkpoints for step-0.
5d5684b73250:6360:6921 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:6360:6921 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:6360:6921 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:6360:6921 [0] NCCL INFO P2P plugin IBext
5d5684b73250:6360:6921 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:6360:6921 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:6360:6921 [0] NCCL INFO Using network Socket
5d5684b73250:6360:6921 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:6360:6921 [0] NCCL INFO Channel 00/04 :    0   1   2   3
5d5684b73250:6360:6921 [0] NCCL INFO Channel 01/04 :    0   3   2   1
5d5684b73250:6360:6921 [0] NCCL INFO Channel 02/04 :    0   1   2   3
5d5684b73250:6360:6921 [0] NCCL INFO Channel 03/04 :    0   3   2   1
5d5684b73250:6360:6921 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
5d5684b73250:6360:6921 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:6360:6921 [0] NCCL INFO Channel 02 : 0[3b000] -> 1[5e000] via SHM/direct/direct
[5d5684b73250:6360 :0:6921] Caught signal 7 (Bus error: nonexistent physical address)
[5d5684b73250:6363 :0:6931] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   6921) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000007b0ef ncclGroupEnd()  ???:0
 4 0x0000000000059e97 ncclGetUniqueId()  ???:0
 5 0x00000000000489b1 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a655 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x00000000000652a6 ncclRedOpDestroy()  ???:0
 8 0x000000000004ae3b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 9 0x000000000004b098 ncclCommInitRank()  ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm()  /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute()  /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce()  /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
23 0x0000000000008609 start_thread()  ???:0
24 0x000000000011f133 clone()  ???:0
=================================
[5d5684b73250:06360] *** Process received signal ***
[5d5684b73250:06360] Signal: Bus error (7)
[5d5684b73250:06360] Signal code:  (-6)
[5d5684b73250:06360] Failing at address: 0x18d8
[5d5684b73250:06360] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f2d243b0090]
[5d5684b73250:06360] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7f2d244f8b41]
[5d5684b73250:06360] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7f2c15ce287d]
[5d5684b73250:06360] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7b0ef)[0x7f2c15ce80ef]
[5d5684b73250:06360] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e97)[0x7f2c15cc6e97]
[5d5684b73250:06360] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x489b1)[0x7f2c15cb59b1]
[5d5684b73250:06360] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a655)[0x7f2c15cb7655]
[5d5684b73250:06360] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x652a6)[0x7f2c15cd22a6]
[5d5684b73250:06360] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae3b)[0x7f2c15cb7e3b]
[5d5684b73250:06360] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7f2c15cb8098]
[5d5684b73250:06360] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7f2b6d7c1354]
[5d5684b73250:06360] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7f2b6d7c1581]
[5d5684b73250:06360] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7f2b6d7833cd]
[5d5684b73250:06360] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7f2b6d7837fc]
[5d5684b73250:06360] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7f2b6d75202d]
[5d5684b73250:06360] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f2d23718de4]
[5d5684b73250:06360] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f2d24352609]
[5d5684b73250:06360] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f2d2448c133]
[5d5684b73250:06360] *** End of error message ***
==== backtrace (tid:   6931) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000007b0ef ncclGroupEnd()  ???:0
 4 0x0000000000059e97 ncclGetUniqueId()  ???:0
 5 0x00000000000489b1 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a655 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x00000000000652a6 ncclRedOpDestroy()  ???:0
 8 0x000000000004ae3b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 9 0x000000000004b098 ncclCommInitRank()  ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm()  /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute()  /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce()  /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
23 0x0000000000008609 start_thread()  ???:0
24 0x000000000011f133 clone()  ???:0
=================================
[5d5684b73250:06363] *** Process received signal ***
[5d5684b73250:06363] Signal: Bus error (7)
[5d5684b73250:06363] Signal code:  (-6)
[5d5684b73250:06363] Failing at address: 0x18db
[5d5684b73250:06363] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff118fff090]
[5d5684b73250:06363] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7ff119147b41]
[5d5684b73250:06363] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7ff00a93187d]
[5d5684b73250:06363] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7b0ef)[0x7ff00a9370ef]
[5d5684b73250:06363] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e97)[0x7ff00a915e97]
[5d5684b73250:06363] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x489b1)[0x7ff00a9049b1]
[5d5684b73250:06363] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a655)[0x7ff00a906655]
[5d5684b73250:06363] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x652a6)[0x7ff00a9212a6]
[5d5684b73250:06363] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae3b)[0x7ff00a906e3b]
[5d5684b73250:06363] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7ff00a907098]
[5d5684b73250:06363] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7ff005d87354]
[5d5684b73250:06363] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7ff005d87581]
[5d5684b73250:06363] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7ff005d493cd]
[5d5684b73250:06363] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7ff005d497fc]
[5d5684b73250:06363] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7ff005d1802d]
[5d5684b73250:06363] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7ff118367de4]
[5d5684b73250:06363] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7ff118fa1609]
[5d5684b73250:06363] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff1190db133]
[5d5684b73250:06363] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 5d5684b73250 exited on signal 7 (Bus error).

For 3 GPUs , I get this error:

2023-03-22 14:24:52,808 [INFO] root: Tensors rasterized.
2023-03-22 14:24:53,000 [INFO] __main__: Found 8600 samples in training set
2023-03-22 14:24:53,006 [INFO] root: Rasterizing tensors.
2023-03-22 14:24:53,224 [INFO] root: Tensors rasterized.
2023-03-22 14:24:53,493 [INFO] root: Training graph built.
2023-03-22 14:24:53,493 [INFO] root: Running training loop.
2023-03-22 14:24:53,493 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:53,493 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:53,493 [INFO] __main__: Images logged at every 0 steps
2023-03-22 14:24:54,763 [INFO] root: Training graph built.
2023-03-22 14:24:54,763 [INFO] root: Running training loop.
2023-03-22 14:24:54,763 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:54,763 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:54,763 [INFO] __main__: Images logged at every 0 steps
INFO:tensorflow:Graph was finalized.
2023-03-22 14:24:54,791 [INFO] tensorflow: Graph was finalized.
2023-03-22 14:24:56,127 [INFO] root: Training graph built.
2023-03-22 14:24:56,127 [INFO] root: Building validation graph.
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 80, io threads: 160, compute threads: 80, buffered batches: 4
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 1400, number of sources: 1, batch size per gpu: 4, steps: 350
WARNING:tensorflow:Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,140 [WARNING] tensorflow: Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Graph was finalized.
2023-03-22 14:24:56,145 [INFO] tensorflow: Graph was finalized.
2023-03-22 14:24:56,159 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2023-03-22 14:24:56,393 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2023-03-22 14:24:56,397 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2023-03-22 14:24:56,397 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,412 [WARNING] tensorflow: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,640 [INFO] __main__: Found 1400 samples in validation set
2023-03-22 14:24:56,640 [INFO] root: Rasterizing tensors.
2023-03-22 14:24:56,857 [INFO] root: Tensors rasterized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:24:57,104 [INFO] tensorflow: Running local_init_op.
2023-03-22 14:24:57,184 [INFO] root: Validation graph built.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:24:57,570 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:24:58,493 [INFO] root: Running training loop.
2023-03-22 14:24:58,494 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:58,494 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:58,494 [INFO] __main__: Images logged at every 3585 steps
INFO:tensorflow:Create CheckpointSaverHook.
2023-03-22 14:24:58,497 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:24:58,607 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:24:59,102 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2023-03-22 14:25:01,129 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:25:03,774 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:25:04,510 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-22 14:25:14,773 [INFO] tensorflow: Saving checkpoints for step-0.
5d5684b73250:8162:8585 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:8162:8585 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:8162:8585 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:8162:8585 [0] NCCL INFO P2P plugin IBext
5d5684b73250:8162:8585 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:8162:8585 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:8162:8585 [0] NCCL INFO Using network Socket
5d5684b73250:8162:8585 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:8162:8585 [0] NCCL INFO Channel 00/02 :    0   1   2
5d5684b73250:8162:8585 [0] NCCL INFO Channel 01/02 :    0   1   2
5d5684b73250:8162:8585 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
5d5684b73250:8162:8585 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:8162:8585 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:8162:8585 [0] NCCL INFO Connected all rings
[5d5684b73250:8165 :0:9402] Caught signal 7 (Bus error: nonexistent physical address)
[5d5684b73250:8162 :0:8585] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   9402) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000006b246 ncclGroupEnd()  ???:0
 4 0x0000000000008609 start_thread()  ???:0
 5 0x000000000011f133 clone()  ???:0
=================================
[5d5684b73250:08165] *** Process received signal ***
[5d5684b73250:08165] Signal: Bus error (7)
[5d5684b73250:08165] Signal code:  (-6)
[5d5684b73250:08165] Failing at address: 0x1fe5
[5d5684b73250:08165] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff4efdb0090]
[5d5684b73250:08165] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7ff4efef8b41]
[5d5684b73250:08165] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7ff3e16e287d]
[5d5684b73250:08165] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6b246)[0x7ff3e16d8246]
[5d5684b73250:08165] [ 4] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7ff4efd52609]
[5d5684b73250:08165] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff4efe8c133]
[5d5684b73250:08165] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node 5d5684b73250 exited on signal 7 (Bus error).

Can you run nccl-test? Please run it inside the tao docker( I think you already login inside the tao docker)
Then, please run nccl-test as below for 3gpus or 4gpus.
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

Please share the log with us.

Please see the log below:

root@5d5684b73250:/workspace# cd nccl-tests/
root@5d5684b73250:/workspace/nccl-tests# make
make -C src build BUILDDIR=/workspace/nccl-tests/build
make[1]: Entering directory '/workspace/nccl-tests/src'
Compiling  timer.cc                            > /workspace/nccl-tests/build/timer.o
Compiling /workspace/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/nccl-tests/build/common.o
Linking  /workspace/nccl-tests/build/all_reduce.o > /workspace/nccl-tests/build/all_reduce_perf
Compiling  all_gather.cu                       > /workspace/nccl-tests/build/all_gather.o
Linking  /workspace/nccl-tests/build/all_gather.o > /workspace/nccl-tests/build/all_gather_perf
Compiling  broadcast.cu                        > /workspace/nccl-tests/build/broadcast.o
Linking  /workspace/nccl-tests/build/broadcast.o > /workspace/nccl-tests/build/broadcast_perf
Compiling  reduce_scatter.cu                   > /workspace/nccl-tests/build/reduce_scatter.o
Linking  /workspace/nccl-tests/build/reduce_scatter.o > /workspace/nccl-tests/build/reduce_scatter_perf
Compiling  reduce.cu                           > /workspace/nccl-tests/build/reduce.o
Linking  /workspace/nccl-tests/build/reduce.o > /workspace/nccl-tests/build/reduce_perf
Compiling  alltoall.cu                         > /workspace/nccl-tests/build/alltoall.o
Linking  /workspace/nccl-tests/build/alltoall.o > /workspace/nccl-tests/build/alltoall_perf
Compiling  scatter.cu                          > /workspace/nccl-tests/build/scatter.o
Linking  /workspace/nccl-tests/build/scatter.o > /workspace/nccl-tests/build/scatter_perf
Compiling  gather.cu                           > /workspace/nccl-tests/build/gather.o
Linking  /workspace/nccl-tests/build/gather.o > /workspace/nccl-tests/build/gather_perf
Compiling  sendrecv.cu                         > /workspace/nccl-tests/build/sendrecv.o
Linking  /workspace/nccl-tests/build/sendrecv.o > /workspace/nccl-tests/build/sendrecv_perf
Compiling  hypercube.cu                        > /workspace/nccl-tests/build/hypercube.o
Linking  /workspace/nccl-tests/build/hypercube.o > /workspace/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/workspace/nccl-tests/src'
root@5d5684b73250:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  15332 on 5d5684b73250 device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid  15332 on 5d5684b73250 device  1 [0x5e] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid  15332 on 5d5684b73250 device  2 [0x86] NVIDIA RTX A6000
#  Rank  3 Group  0 Pid  15332 on 5d5684b73250 device  3 [0xaf] NVIDIA RTX A6000
5d5684b73250:15332:15332 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:15332:15332 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:15332:15332 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:15332:15332 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:15332:15332 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:15332:15332 [3] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:15332:15346 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:15332:15346 [1] NCCL INFO P2P plugin IBext
5d5684b73250:15332:15346 [1] NCCL INFO NET/IB : No device found.
5d5684b73250:15332:15346 [1] NCCL INFO NET/IB : No device found.
5d5684b73250:15332:15346 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:15332:15346 [1] NCCL INFO Using network Socket
5d5684b73250:15332:15348 [3] NCCL INFO Using network Socket
5d5684b73250:15332:15347 [2] NCCL INFO Using network Socket
5d5684b73250:15332:15345 [0] NCCL INFO Using network Socket
5d5684b73250:15332:15346 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
5d5684b73250:15332:15347 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
5d5684b73250:15332:15348 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000
5d5684b73250:15332:15345 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:15332:15348 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 2/-1/-1->3->0 [2] -1/-1/-1->3->2 [3] 2/-1/-1->3->0
5d5684b73250:15332:15347 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 1/-1/-1->2->3 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3
5d5684b73250:15332:15345 [0] NCCL INFO Channel 00/04 :    0   1   2   3
5d5684b73250:15332:15345 [0] NCCL INFO Channel 01/04 :    0   3   2   1
5d5684b73250:15332:15346 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->2
5d5684b73250:15332:15345 [0] NCCL INFO Channel 02/04 :    0   1   2   3
5d5684b73250:15332:15345 [0] NCCL INFO Channel 03/04 :    0   3   2   1
5d5684b73250:15332:15345 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
5d5684b73250:15332:15347 [2] NCCL INFO Channel 00 : 2[86000] -> 3[af000] via SHM/direct/direct
5d5684b73250:15332:15345 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:15332:15347 [2] NCCL INFO Channel 02 : 2[86000] -> 3[af000] via SHM/direct/direct
5d5684b73250:15332:15345 [0] NCCL INFO Channel 02 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:15332:15348 [3] NCCL INFO Channel 00/0 : 3[af000] -> 0[3b000] via P2P/direct pointer
5d5684b73250:15332:15348 [3] NCCL INFO Channel 02/0 : 3[af000] -> 0[3b000] via P2P/direct pointer
5d5684b73250:15332:15346 [1] NCCL INFO Channel 00/0 : 1[5e000] -> 2[86000] via P2P/direct pointer
5d5684b73250:15332:15348 [3] NCCL INFO Channel 01 : 3[af000] -> 2[86000] via SHM/direct/direct
5d5684b73250:15332:15348 [3] NCCL INFO Channel 03 : 3[af000] -> 2[86000] via SHM/direct/direct
5d5684b73250:15332:15346 [1] NCCL INFO Channel 02/0 : 1[5e000] -> 2[86000] via P2P/direct pointer
5d5684b73250:15332:15346 [1] NCCL INFO Channel 01 : 1[5e000] -> 0[3b000] via SHM/direct/direct
5d5684b73250:15332:15346 [1] NCCL INFO Channel 03 : 1[5e000] -> 0[3b000] via SHM/direct/direct
[1679561506.255767] [5d5684b73250:15332:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1679561506.255769] [5d5684b73250:15332:1]           debug.c:1289 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[5d5684b73250:15332:0:15347] Caught signal 7 (Bus error: nonexistent physical address)
[5d5684b73250:15332:1:15345] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:  15345) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000007b0ef ncclGroupEnd()  ???:0
 4 0x0000000000059e97 ncclGetUniqueId()  ???:0
 5 0x00000000000489b1 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a655 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x0000000000063dcc ncclRedOpDestroy()  ???:0
 8 0x0000000000008609 start_thread()  ???:0
 9 0x000000000011f133 clone()  ???:0
=================================
Bus error (core dumped)
root@5d5684b73250:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  15353 on 5d5684b73250 device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid  15353 on 5d5684b73250 device  1 [0x5e] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid  15353 on 5d5684b73250 device  2 [0x86] NVIDIA RTX A6000
5d5684b73250:15353:15353 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:15353:15353 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:15353:15353 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:15353:15353 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:15353:15353 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:15353:15353 [2] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:15353:15364 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:15353:15364 [0] NCCL INFO P2P plugin IBext
5d5684b73250:15353:15364 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:15353:15364 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:15353:15364 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:15353:15364 [0] NCCL INFO Using network Socket
5d5684b73250:15353:15365 [1] NCCL INFO Using network Socket
5d5684b73250:15353:15366 [2] NCCL INFO Using network Socket
5d5684b73250:15353:15365 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
5d5684b73250:15353:15364 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:15353:15366 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
5d5684b73250:15353:15365 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
5d5684b73250:15353:15364 [0] NCCL INFO Channel 00/02 :    0   1   2
5d5684b73250:15353:15364 [0] NCCL INFO Channel 01/02 :    0   1   2
5d5684b73250:15353:15366 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
5d5684b73250:15353:15364 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
5d5684b73250:15353:15366 [2] NCCL INFO Channel 00 : 2[86000] -> 0[3b000] via SHM/direct/direct
5d5684b73250:15353:15366 [2] NCCL INFO Channel 01 : 2[86000] -> 0[3b000] via SHM/direct/direct
5d5684b73250:15353:15365 [1] NCCL INFO Channel 00/0 : 1[5e000] -> 2[86000] via P2P/direct pointer
5d5684b73250:15353:15364 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:15353:15364 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:15353:15365 [1] NCCL INFO Channel 01/0 : 1[5e000] -> 2[86000] via P2P/direct pointer
5d5684b73250:15353:15365 [1] NCCL INFO Connected all rings
5d5684b73250:15353:15364 [0] NCCL INFO Connected all rings
5d5684b73250:15353:15366 [2] NCCL INFO Connected all rings
5d5684b73250:15353:15366 [2] NCCL INFO Channel 00/0 : 2[86000] -> 1[5e000] via P2P/direct pointer
[5d5684b73250:15353:0:15364] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)

So, there is issue even running with nccl-test.
Could you update nccl inside the tao container?

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo apt-get update
$ sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0

Then run above-mentioned nccl-test again? Thanks.
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

I ran the commands inside the same docker container

I still get the error

root@5d5684b73250:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  17487 on 5d5684b73250 device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid  17487 on 5d5684b73250 device  1 [0x5e] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid  17487 on 5d5684b73250 device  2 [0x86] NVIDIA RTX A6000
#  Rank  3 Group  0 Pid  17487 on 5d5684b73250 device  3 [0xaf] NVIDIA RTX A6000
5d5684b73250:17487:17487 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo,docker
5d5684b73250:17487:17487 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:17487:17487 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:17487:17487 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:17487:17487 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:17487:17487 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:17487:17487 [3] NCCL INFO cudaDriverVersion 12000
NCCL version 2.17.1+cuda12.0
5d5684b73250:17487:17503 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:17487:17503 [3] NCCL INFO P2P plugin IBext
5d5684b73250:17487:17503 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo,docker
5d5684b73250:17487:17503 [3] NCCL INFO NET/IB : No device found.
5d5684b73250:17487:17503 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo,docker
5d5684b73250:17487:17503 [3] NCCL INFO NET/IB : No device found.
5d5684b73250:17487:17503 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo,docker
5d5684b73250:17487:17503 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:17487:17503 [3] NCCL INFO Using network Socket
5d5684b73250:17487:17500 [0] NCCL INFO Using network Socket
5d5684b73250:17487:17502 [2] NCCL INFO Using network Socket
5d5684b73250:17487:17501 [1] NCCL INFO Using network Socket
5d5684b73250:17487:17501 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
5d5684b73250:17487:17503 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000
5d5684b73250:17487:17502 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
5d5684b73250:17487:17500 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:17487:17503 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 2/-1/-1->3->0 [2] -1/-1/-1->3->2 [3] 2/-1/-1->3->0
5d5684b73250:17487:17503 [3] NCCL INFO P2P Chunksize set to 524288
5d5684b73250:17487:17502 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 1/-1/-1->2->3 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3
5d5684b73250:17487:17501 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->2
5d5684b73250:17487:17501 [1] NCCL INFO P2P Chunksize set to 524288
5d5684b73250:17487:17502 [2] NCCL INFO P2P Chunksize set to 524288
5d5684b73250:17487:17500 [0] NCCL INFO Channel 00/04 :    0   1   2   3
5d5684b73250:17487:17500 [0] NCCL INFO Channel 01/04 :    0   3   2   1
5d5684b73250:17487:17500 [0] NCCL INFO Channel 02/04 :    0   1   2   3
5d5684b73250:17487:17500 [0] NCCL INFO Channel 03/04 :    0   3   2   1
5d5684b73250:17487:17500 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
5d5684b73250:17487:17500 [0] NCCL INFO P2P Chunksize set to 524288
5d5684b73250:17487:17503 [3] NCCL INFO Channel 00/0 : 3[af000] -> 0[3b000] via P2P/direct pointer
5d5684b73250:17487:17501 [1] NCCL INFO Channel 00/0 : 1[5e000] -> 2[86000] via P2P/direct pointer
5d5684b73250:17487:17503 [3] NCCL INFO Channel 02/0 : 3[af000] -> 0[3b000] via P2P/direct pointer
5d5684b73250:17487:17501 [1] NCCL INFO Channel 02/0 : 1[5e000] -> 2[86000] via P2P/direct pointer
[5d5684b73250:17487:0:17502] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@5d5684b73250:/workspace/nccl-tests# /build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
bash: /build/all_reduce_perf: No such file or directory
root@5d5684b73250:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  17509 on 5d5684b73250 device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid  17509 on 5d5684b73250 device  1 [0x5e] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid  17509 on 5d5684b73250 device  2 [0x86] NVIDIA RTX A6000
5d5684b73250:17509:17509 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo,docker
5d5684b73250:17509:17509 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:17509:17509 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:17509:17509 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:17509:17509 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:17509:17509 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:17509:17509 [2] NCCL INFO cudaDriverVersion 12000
NCCL version 2.17.1+cuda12.0
5d5684b73250:17509:17522 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:17509:17522 [2] NCCL INFO P2P plugin IBext
5d5684b73250:17509:17522 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo,docker
5d5684b73250:17509:17522 [2] NCCL INFO NET/IB : No device found.
5d5684b73250:17509:17522 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo,docker
5d5684b73250:17509:17522 [2] NCCL INFO NET/IB : No device found.
5d5684b73250:17509:17522 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo,docker
5d5684b73250:17509:17522 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:17509:17522 [2] NCCL INFO Using network Socket
5d5684b73250:17509:17520 [0] NCCL INFO Using network Socket
5d5684b73250:17509:17521 [1] NCCL INFO Using network Socket
5d5684b73250:17509:17520 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:17509:17521 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
5d5684b73250:17509:17522 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
5d5684b73250:17509:17521 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
5d5684b73250:17509:17521 [1] NCCL INFO P2P Chunksize set to 524288
5d5684b73250:17509:17522 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
5d5684b73250:17509:17522 [2] NCCL INFO P2P Chunksize set to 524288
5d5684b73250:17509:17520 [0] NCCL INFO Channel 00/02 :    0   1   2
5d5684b73250:17509:17520 [0] NCCL INFO Channel 01/02 :    0   1   2
5d5684b73250:17509:17520 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
5d5684b73250:17509:17520 [0] NCCL INFO P2P Chunksize set to 524288
[5d5684b73250:17509:0:17521] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
'''

Could you share the result of
$ nvidia-smi topo -m
$ ifconfig -s