More than 1 GPU not working using Tao Train

user82614 · March 1, 2023, 9:09am

Please provide the following information when requesting support.

• Hardware 4x A6000 GPUs
• Network Type (Detectnet_v2)

Hello, I’m trying to train my model using 3 GPUs instead of 1. However when I run the Tao Train command I get the following error:

`Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 0 on node fc3a502973d9 exited on signal 7 (Bus error).

Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-03-01 08:57:18,531 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.`

I don’t understand what the error is telling me. I’ve attached the training log and error log for you to view.

training_log (5.9 KB)

error_log (93.9 KB)

Thanks.

Morganh · March 2, 2023, 10:18am

From the error log,

2023-03-01 08:56:34,799 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-01 08:56:42,278 [INFO] tensorflow: Saving checkpoints for step-0.
fc3a502973d9:255:690 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
fc3a502973d9:255:690 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
fc3a502973d9:255:690 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
fc3a502973d9:255:690 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
fc3a502973d9:255:690 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
fc3a502973d9:255:690 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.1+cuda11.8
fc3a502973d9:255:690 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
fc3a502973d9:255:690 [0] NCCL INFO P2P plugin IBext
fc3a502973d9:255:690 [0] NCCL INFO NET/IB : No device found.
fc3a502973d9:255:690 [0] NCCL INFO NET/IB : No device found.
fc3a502973d9:255:690 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
fc3a502973d9:255:690 [0] NCCL INFO Using network Socket
fc3a502973d9:255:690 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
fc3a502973d9:255:690 [0] NCCL INFO Channel 00/02 :    0   1   2
fc3a502973d9:255:690 [0] NCCL INFO Channel 01/02 :    0   1   2
fc3a502973d9:255:690 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
fc3a502973d9:255:690 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
fc3a502973d9:255:690 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via SHM/direct/direct
fc3a502973d9:255:690 [0] NCCL INFO Connected all rings
[fc3a502973d9:255  :0:690] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:    690) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x00000000000755bd ncclGroupEnd()  ???:0
 3 0x000000000007a74f ncclGroupEnd()  ???:0
 4 0x0000000000059e67 ncclGetUniqueId()  ???:0
 5 0x0000000000048b3b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a5c2 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x0000000000064f66 ncclRedOpDestroy()  ???:0
 8 0x000000000004ae0b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 9 0x000000000004b068 ncclCommInitRank()  ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm()  /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute()  /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce()  /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
23 0x0000000000008609 start_thread()  ???:0
24 0x000000000011f133 clone()  ???:0
=================================
[fc3a502973d9:00255] *** Process received signal ***
[fc3a502973d9:00255] Signal: Bus error (7)
[fc3a502973d9:00255] Signal code:  (-6)
[fc3a502973d9:00255] Failing at address: 0x3e8000000ff
[fc3a502973d9:00255] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe894709090]
[fc3a502973d9:00255] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7fe894851b41]
[fc3a502973d9:00255] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x755bd)[0x7fe78606d5bd]
[fc3a502973d9:00255] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7a74f)[0x7fe78607274f]
[fc3a502973d9:00255] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e67)[0x7fe786051e67]
[fc3a502973d9:00255] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x48b3b)[0x7fe786040b3b]
[fc3a502973d9:00255] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a5c2)[0x7fe7860425c2]
[fc3a502973d9:00255] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x64f66)[0x7fe78605cf66]
[fc3a502973d9:00255] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae0b)[0x7fe786042e0b]
[fc3a502973d9:00255] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7fe786043068]
[fc3a502973d9:00255] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7fe769282354]
[fc3a502973d9:00255] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7fe769282581]
[fc3a502973d9:00255] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fe7692443cd]
[fc3a502973d9:00255] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7fe7692447fc]
[fc3a502973d9:00255] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7fe76921302d]
[fc3a502973d9:00255] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7fe893a71de4]
[fc3a502973d9:00255] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fe8946ab609]
[fc3a502973d9:00255] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fe8947e5133]
[fc3a502973d9:00255] *** End of error message ***

You are running with WSL, right?

Can you share the result of $nvidia-smi ?

user82614 · March 2, 2023, 3:22pm

hu Mar  2 15:21:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:3B:00.0 Off |                  Off |
|  0%   57C    P2   286W / 300W |   9916MiB / 49140MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:5E:00.0 Off |                  Off |
|  0%   45C    P8    20W / 300W |     15MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    Off  | 00000000:86:00.0 Off |                  Off |
|  0%   45C    P8    23W / 300W |     14MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    Off  | 00000000:AF:00.0  On |                  Off |
|  0%   47C    P8    30W / 300W |    641MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2288      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2855      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    798840      C   /usr/bin/python3.6                260MiB |
|    0   N/A  N/A    799061      C   python3.6                        9638MiB |
|    1   N/A  N/A      2288      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2855      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2288      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2855      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2288      G   /usr/lib/xorg/Xorg                110MiB |
|    3   N/A  N/A      2855      G   /usr/lib/xorg/Xorg                264MiB |
|    3   N/A  N/A      2985      G   /usr/bin/gnome-shell               92MiB |
|    3   N/A  N/A    642139      G   /usr/lib/firefox/firefox          157MiB |
+-----------------------------------------------------------------------------+

I’m not using WSL, I’m using an Ubuntu computer.

Thanks.

Morganh · March 3, 2023, 1:36pm

To narrow down, could you try to use 520 driver instead?

sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

sudo apt install nvidia-driver-520

Morganh · March 7, 2023, 2:25pm

More, to narrow down, can you run inside the tao docker and update the NCCL?
Refer to the steps mentioned in https://developer.nvidia.com/nccl/nccl-download,

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo apt-get update

then run the following command to installer NCCL:
For Ubuntu: sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0

More, which Ubuntu are you running? 20.04 or 22.04?

user82614 · March 9, 2023, 9:50am

Hello,

so I tried installing nvidia-driver-520, however it just downloads driver 525 instead. Not sure why?

user82614 · March 9, 2023, 9:55am

So I followed the instruction as you mentioned.

When I run sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0

I get

sudo apt install libnccl2=2.17.1-1+cuda12.0 libnccl-dev=2.17.1-1+cuda12.0
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package libnccl2
E: Unable to locate package libnccl-dev

I am using Ubuntu 20.04

Morganh · March 13, 2023, 7:28am

Can you run
$ apt-get update

user82614 · March 13, 2023, 4:10pm

I tried the command and I still get a similar error (shown as attached):
error_log (93.9 KB)

Morganh · March 20, 2023, 4:51am

Sorry for late reply. Could you please docker pull the latest docker to run?

docker pull nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5

Then login the docker. Similar to below.
$ docker run --runtime=nvidia -it --rm -v yourlocalfolder:/workspace nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then, run command without “tao”.
# detectnet_v2 train blabla

user82614 · March 22, 2023, 9:57am

Okay, so i changed up from using tao cli tool to running from docker directly.

I sucessfully created the tfrecords from the docker directly.

For training, this is the command i used:

detectnet_v2 train -e /workspace/specs/detectnet_v2_train_resnet18_kitti.txt  -r /workspace/detectnet_v2/experiment_dir_unpruned  -k tlt_encode  -n resnet18_detector  --gpus 4

The result I got is shown below it still didn’t run with 4 GPUs.

 File "<frozen iva.detectnet_v2.scripts.train>", line 1011, in <module>
  File "<decorator-gen-117>", line 2, in main
  File "<frozen iva.detectnet_v2.utilities.timer>", line 46, in wrapped_fn
  File "<frozen iva.detectnet_v2.scripts.train>", line 994, in main
  File "<frozen iva.detectnet_v2.scripts.train>", line 853, in run_experiment
  File "<frozen iva.detectnet_v2.scripts.train>", line 680, in train_gridbox
  File "<frozen iva.detectnet_v2.training.training_proto_utilities>", line 109, in build_learning_rate_schedule
  File "<frozen moduluspy.modulus.hooks.utils>", line 40, in get_softstart_annealing_learning_rate
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/tf_should_use.py", line 198, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 173, in Assert
    guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1235, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 171, in true_assert
    condition, data, summarize, name="Assert")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 74, in _assert
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()



INFO:tensorflow:Saving checkpoints for step-64500.
2023-03-22 09:53:44,010 [INFO] tensorflow: Saving checkpoints for step-64500.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21229,1],2]
  Exit code:    1
--------------------------------------------------------------------------
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Morganh · March 22, 2023, 12:33pm

To narrow down, did you ever run with 1gpu successfully?

user82614 · March 22, 2023, 1:07pm

Yes, I’ve even tried with 2 GPU and that was okay. Any more, an error occurs.

Morganh · March 22, 2023, 1:27pm

So, may I conclude that
1gpu → no error
2gpu → no error
3gpu → has error
4gpu → has error

user82614 · March 22, 2023, 1:28pm

yes correct

Morganh · March 22, 2023, 1:34pm

Are all the experiments using the same spec file? Could you share with us?

user82614 · March 22, 2023, 1:40pm

Yes, this is the training spec file
detectnet_v2_train_resnet18_kitti.txt (5.9 KB)

Morganh · March 22, 2023, 1:53pm

For 3gpus, can you use a new result folder and retry? For example,
-r /workspace/detectnet_v2/experiment_dir_unpruned_3gpu

For 4gpus, can you also use a new result folder and retry? For example,
-r /workspace/detectnet_v2/experiment_dir_unpruned_4gpu

user82614 · March 22, 2023, 2:26pm

So doing that with 4 GPU, I get the following error

2023-03-22 14:21:11,533 [INFO] __main__: Found 1400 samples in validation set
2023-03-22 14:21:11,533 [INFO] root: Rasterizing tensors.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:11,658 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:21:11,762 [INFO] root: Tensors rasterized.
2023-03-22 14:21:12,110 [INFO] root: Validation graph built.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:12,944 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:13,403 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:13,444 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:21:13,455 [INFO] root: Running training loop.
2023-03-22 14:21:13,456 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:21:13,456 [INFO] __main__: Scalars logged at every 10 steps
2023-03-22 14:21:13,456 [INFO] __main__: Images logged at every 2690 steps
INFO:tensorflow:Create CheckpointSaverHook.
2023-03-22 14:21:13,461 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:13,919 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2023-03-22 14:21:16,203 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:18,931 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:19,679 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-22 14:21:30,357 [INFO] tensorflow: Saving checkpoints for step-0.
5d5684b73250:6360:6921 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:6360:6921 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:6360:6921 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:6360:6921 [0] NCCL INFO P2P plugin IBext
5d5684b73250:6360:6921 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:6360:6921 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:6360:6921 [0] NCCL INFO Using network Socket
5d5684b73250:6360:6921 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:6360:6921 [0] NCCL INFO Channel 00/04 :    0   1   2   3
5d5684b73250:6360:6921 [0] NCCL INFO Channel 01/04 :    0   3   2   1
5d5684b73250:6360:6921 [0] NCCL INFO Channel 02/04 :    0   1   2   3
5d5684b73250:6360:6921 [0] NCCL INFO Channel 03/04 :    0   3   2   1
5d5684b73250:6360:6921 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
5d5684b73250:6360:6921 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:6360:6921 [0] NCCL INFO Channel 02 : 0[3b000] -> 1[5e000] via SHM/direct/direct
[5d5684b73250:6360 :0:6921] Caught signal 7 (Bus error: nonexistent physical address)
[5d5684b73250:6363 :0:6931] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   6921) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000007b0ef ncclGroupEnd()  ???:0
 4 0x0000000000059e97 ncclGetUniqueId()  ???:0
 5 0x00000000000489b1 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a655 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x00000000000652a6 ncclRedOpDestroy()  ???:0
 8 0x000000000004ae3b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 9 0x000000000004b098 ncclCommInitRank()  ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm()  /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute()  /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce()  /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
23 0x0000000000008609 start_thread()  ???:0
24 0x000000000011f133 clone()  ???:0
=================================
[5d5684b73250:06360] *** Process received signal ***
[5d5684b73250:06360] Signal: Bus error (7)
[5d5684b73250:06360] Signal code:  (-6)
[5d5684b73250:06360] Failing at address: 0x18d8
[5d5684b73250:06360] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f2d243b0090]
[5d5684b73250:06360] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7f2d244f8b41]
[5d5684b73250:06360] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7f2c15ce287d]
[5d5684b73250:06360] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7b0ef)[0x7f2c15ce80ef]
[5d5684b73250:06360] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e97)[0x7f2c15cc6e97]
[5d5684b73250:06360] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x489b1)[0x7f2c15cb59b1]
[5d5684b73250:06360] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a655)[0x7f2c15cb7655]
[5d5684b73250:06360] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x652a6)[0x7f2c15cd22a6]
[5d5684b73250:06360] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae3b)[0x7f2c15cb7e3b]
[5d5684b73250:06360] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7f2c15cb8098]
[5d5684b73250:06360] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7f2b6d7c1354]
[5d5684b73250:06360] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7f2b6d7c1581]
[5d5684b73250:06360] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7f2b6d7833cd]
[5d5684b73250:06360] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7f2b6d7837fc]
[5d5684b73250:06360] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7f2b6d75202d]
[5d5684b73250:06360] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f2d23718de4]
[5d5684b73250:06360] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f2d24352609]
[5d5684b73250:06360] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f2d2448c133]
[5d5684b73250:06360] *** End of error message ***
==== backtrace (tid:   6931) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000007b0ef ncclGroupEnd()  ???:0
 4 0x0000000000059e97 ncclGetUniqueId()  ???:0
 5 0x00000000000489b1 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a655 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x00000000000652a6 ncclRedOpDestroy()  ???:0
 8 0x000000000004ae3b ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 9 0x000000000004b098 ncclCommInitRank()  ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm()  /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute()  /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce()  /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
23 0x0000000000008609 start_thread()  ???:0
24 0x000000000011f133 clone()  ???:0
=================================
[5d5684b73250:06363] *** Process received signal ***
[5d5684b73250:06363] Signal: Bus error (7)
[5d5684b73250:06363] Signal code:  (-6)
[5d5684b73250:06363] Failing at address: 0x18db
[5d5684b73250:06363] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff118fff090]
[5d5684b73250:06363] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7ff119147b41]
[5d5684b73250:06363] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7ff00a93187d]
[5d5684b73250:06363] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7b0ef)[0x7ff00a9370ef]
[5d5684b73250:06363] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e97)[0x7ff00a915e97]
[5d5684b73250:06363] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x489b1)[0x7ff00a9049b1]
[5d5684b73250:06363] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a655)[0x7ff00a906655]
[5d5684b73250:06363] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x652a6)[0x7ff00a9212a6]
[5d5684b73250:06363] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae3b)[0x7ff00a906e3b]
[5d5684b73250:06363] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7ff00a907098]
[5d5684b73250:06363] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7ff005d87354]
[5d5684b73250:06363] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7ff005d87581]
[5d5684b73250:06363] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7ff005d493cd]
[5d5684b73250:06363] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7ff005d497fc]
[5d5684b73250:06363] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7ff005d1802d]
[5d5684b73250:06363] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7ff118367de4]
[5d5684b73250:06363] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7ff118fa1609]
[5d5684b73250:06363] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff1190db133]
[5d5684b73250:06363] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 5d5684b73250 exited on signal 7 (Bus error).

For 3 GPUs , I get this error:

2023-03-22 14:24:52,808 [INFO] root: Tensors rasterized.
2023-03-22 14:24:53,000 [INFO] __main__: Found 8600 samples in training set
2023-03-22 14:24:53,006 [INFO] root: Rasterizing tensors.
2023-03-22 14:24:53,224 [INFO] root: Tensors rasterized.
2023-03-22 14:24:53,493 [INFO] root: Training graph built.
2023-03-22 14:24:53,493 [INFO] root: Running training loop.
2023-03-22 14:24:53,493 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:53,493 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:53,493 [INFO] __main__: Images logged at every 0 steps
2023-03-22 14:24:54,763 [INFO] root: Training graph built.
2023-03-22 14:24:54,763 [INFO] root: Running training loop.
2023-03-22 14:24:54,763 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:54,763 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:54,763 [INFO] __main__: Images logged at every 0 steps
INFO:tensorflow:Graph was finalized.
2023-03-22 14:24:54,791 [INFO] tensorflow: Graph was finalized.
2023-03-22 14:24:56,127 [INFO] root: Training graph built.
2023-03-22 14:24:56,127 [INFO] root: Building validation graph.
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 80, io threads: 160, compute threads: 80, buffered batches: 4
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 1400, number of sources: 1, batch size per gpu: 4, steps: 350
WARNING:tensorflow:Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,140 [WARNING] tensorflow: Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Graph was finalized.
2023-03-22 14:24:56,145 [INFO] tensorflow: Graph was finalized.
2023-03-22 14:24:56,159 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2023-03-22 14:24:56,393 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2023-03-22 14:24:56,397 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2023-03-22 14:24:56,397 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,412 [WARNING] tensorflow: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,640 [INFO] __main__: Found 1400 samples in validation set
2023-03-22 14:24:56,640 [INFO] root: Rasterizing tensors.
2023-03-22 14:24:56,857 [INFO] root: Tensors rasterized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:24:57,104 [INFO] tensorflow: Running local_init_op.
2023-03-22 14:24:57,184 [INFO] root: Validation graph built.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:24:57,570 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:24:58,493 [INFO] root: Running training loop.
2023-03-22 14:24:58,494 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:58,494 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:58,494 [INFO] __main__: Images logged at every 3585 steps
INFO:tensorflow:Create CheckpointSaverHook.
2023-03-22 14:24:58,497 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:24:58,607 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:24:59,102 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2023-03-22 14:25:01,129 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:25:03,774 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:25:04,510 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-22 14:25:14,773 [INFO] tensorflow: Saving checkpoints for step-0.
5d5684b73250:8162:8585 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:8162:8585 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:8162:8585 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:8162:8585 [0] NCCL INFO P2P plugin IBext
5d5684b73250:8162:8585 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:8162:8585 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:8162:8585 [0] NCCL INFO Using network Socket
5d5684b73250:8162:8585 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:8162:8585 [0] NCCL INFO Channel 00/02 :    0   1   2
5d5684b73250:8162:8585 [0] NCCL INFO Channel 01/02 :    0   1   2
5d5684b73250:8162:8585 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
5d5684b73250:8162:8585 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:8162:8585 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:8162:8585 [0] NCCL INFO Connected all rings
[5d5684b73250:8165 :0:9402] Caught signal 7 (Bus error: nonexistent physical address)
[5d5684b73250:8162 :0:8585] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   9402) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000006b246 ncclGroupEnd()  ???:0
 4 0x0000000000008609 start_thread()  ???:0
 5 0x000000000011f133 clone()  ???:0
=================================
[5d5684b73250:08165] *** Process received signal ***
[5d5684b73250:08165] Signal: Bus error (7)
[5d5684b73250:08165] Signal code:  (-6)
[5d5684b73250:08165] Failing at address: 0x1fe5
[5d5684b73250:08165] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff4efdb0090]
[5d5684b73250:08165] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7ff4efef8b41]
[5d5684b73250:08165] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7ff3e16e287d]
[5d5684b73250:08165] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6b246)[0x7ff3e16d8246]
[5d5684b73250:08165] [ 4] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7ff4efd52609]
[5d5684b73250:08165] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff4efe8c133]
[5d5684b73250:08165] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node 5d5684b73250 exited on signal 7 (Bus error).

Morganh · March 22, 2023, 3:55pm

Can you run nccl-test? Please run it inside the tao docker( I think you already login inside the tao docker)
Then, please run nccl-test as below for 3gpus or 4gpus.
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

Please share the log with us.

Topic		Replies	Views
WSL2 & TAO issues TAO Toolkit wsl , tao	27	3783	January 5, 2022
TAO API - Detectnet_v2 - Multi GPU Stuck TAO Toolkit	57	1794	August 29, 2023
TAO training on multiple gpus failed TAO Toolkit	10	1150	March 9, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU TAO Toolkit	14	985	November 7, 2023
Yolov4 multi-gpu training with Darknet Arch encounters a problem TAO Toolkit	17	749	July 2, 2023
TAO Toolkit Train of an EfficientDet-D0 is stuck! TAO Toolkit	21	932	August 2, 2022
Multi GPU's and invalid loss TAO Toolkit	18	1176	July 19, 2022
ncclAllReduce failed: unhandled cuda error DGX User Forum	9	4282	May 27, 2021
Cannot train Tao Toolkit UNet model in version v4.0.0 and v4.0.1 TAO Toolkit tao	16	725	July 13, 2023
TAO toolkit happend some .so bug TAO Toolkit tao	19	907	September 9, 2022

More than 1 GPU not working using Tao Train

`Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 0 on node fc3a502973d9 exited on signal 7 (Bus error).

Related topics

`Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.