Yolov4 multi-gpu training with Darknet Arch encounters a problem

Mangesh_mane · June 20, 2023, 1:06pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : - NVIDIA RTX A4000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) :- yolov4
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) :- nvidia/tao/tao-toolkit:4.0.0-tf2.9.1

3985fa320879:182:280 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
3985fa320879:182:280 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.15.1+cuda11.8
3985fa320879:182:280 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
3985fa320879:182:280 [0] NCCL INFO P2P plugin IBext
3985fa320879:182:280 [0] NCCL INFO NET/IB : No device found.
3985fa320879:182:280 [0] NCCL INFO NET/IB : No device found.
3985fa320879:182:280 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
3985fa320879:182:280 [0] NCCL INFO Using network Socket
3985fa320879:182:280 [0] NCCL INFO Setting affinity for GPU 0 to 0f0f
3985fa320879:182:280 [0] NCCL INFO Channel 00/02 : 0 1
3985fa320879:182:280 [0] NCCL INFO Channel 01/02 : 0 1
3985fa320879:182:280 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
3985fa320879:182:280 [0] NCCL INFO Channel 00/0 : 0[8000] → 1[9000] via P2P/IPC
3985fa320879:182:280 [0] NCCL INFO Channel 01/0 : 0[8000] → 1[9000] via P2P/IPC
3985fa320879:182:280 [0] NCCL INFO Connected all rings
3985fa320879:182:280 [0] NCCL INFO Connected all trees
3985fa320879:182:280 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
3985fa320879:182:280 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
3985fa320879:182:280 [0] NCCL INFO comm 0x7fa80bbc2c60 rank 0 nranks 2 cudaDev 0 busId 8000 - Init COMPLETE
1/5934 […] - ETA: 56:01:36 - loss: 14555.9795[3985fa320879:182 :0:280] cma_ep.c:81 process_vm_writev(pid=183 {0x7fa227338000,21504}–>{0x7f89fef35500,21504}) returned -1: Operation not permitted
==== backtrace (tid: 280) ====
0 0x00000000000039f2 uct_cma_ep_tx_error() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:81
1 0x0000000000003d66 uct_cma_ep_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:114
2 0x000000000001e209 uct_scopy_ep_progress_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/base/scopy_ep.c:151
3 0x00000000000516d6 ucs_arbiter_dispatch_nonempty() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.c:321
4 0x000000000001dcf1 ucs_arbiter_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.h:386
5 0x0000000000052467 ucs_callbackq_slow_proxy() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.c:404
6 0x000000000004be9a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.h:211
7 0x000000000004be9a uct_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/api/uct.h:2647
8 0x000000000004be9a ucp_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucp/core/ucp_worker.c:2804
9 0x0000000000037144 opal_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/runtime/opal_progress.c:231
10 0x000000000003dc05 ompi_sync_wait_mt() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/threads/wait_sync.c:85
11 0x0000000000055fba ompi_request_default_wait_all() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/request/req_wait.c:234
12 0x0000000000093949 ompi_coll_base_bcast_intra_basic_linear() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/coll_base_bcast.c:679
13 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
14 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:114
15 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:41
16 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:395
17 0x00000000001055c2 horovod::common::TensorShape::~TensorShape() /opt/horovod/horovod/common/ops/…/common.h:234
18 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:396
19 0x00000000000da52d horovod::common::OperationManager::ExecuteBroadcast() /opt/horovod/horovod/common/ops/operation_manager.cc:66
20 0x00000000000da901 horovod::common::OperationManager::ExecuteOperation() /opt/horovod/horovod/common/ops/operation_manager.cc:116
21 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:297
22 0x00000000000a902d std::__shared_ptr<CUevent_st, (__gnu_cxx::_Lock_policy)2>::operator=() /usr/include/c++/9/bits/shared_ptr_base.h:1265
23 0x00000000000a902d std::shared_ptr<CUevent_st>::operator=() /usr/include/c++/9/bits/shared_ptr.h:335
24 0x00000000000a902d horovod::common::Event::operator=() /opt/horovod/horovod/common/common.h:185
25 0x00000000000a902d horovod::common::Status::operator=() /opt/horovod/horovod/common/common.h:197
26 0x00000000000a902d PerformOperation() /opt/horovod/horovod/common/operations.cc:297
27 0x00000000000a902d RunLoopOnce() /opt/horovod/horovod/common/operations.cc:787
28 0x00000000000a902d BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:651
29 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
30 0x0000000000008609 start_thread() ???:0
31 0x000000000011f133 clone() ???:0

[3985fa320879:00182] *** Process received signal ***
[3985fa320879:00182] Signal: Aborted (6)
[3985fa320879:00182] Signal code: (-6)
[3985fa320879:00182] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fa8f2ba2090]
[3985fa320879:00182] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fa8f2ba200b]
[3985fa320879:00182] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fa8f2b81859]
[3985fa320879:00182] [ 3] /opt/hpcx/ucx/lib/libucs.so.0(+0x5a7dd)[0x7fa80d65f7dd]
[3985fa320879:00182] [ 4] /opt/hpcx/ucx/lib/libucs.so.0(+0x5fdc2)[0x7fa80d664dc2]
[3985fa320879:00182] [ 5] /opt/hpcx/ucx/lib/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fa80d665194]
[3985fa320879:00182] [ 6] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(+0x39f2)[0x7fa80c4a69f2]
[3985fa320879:00182] [ 7] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x186)[0x7fa80c4a6d66]
[3985fa320879:00182] [ 8] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x69)[0x7fa80d5e4209]
[3985fa320879:00182] [ 9] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0xb6)[0x7fa80d6566d6]
[3985fa320879:00182] [10] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x81)[0x7fa80d5e3cf1]
[3985fa320879:00182] [11] /opt/hpcx/ucx/lib/libucs.so.0(+0x52467)[0x7fa80d657467]
[3985fa320879:00182] [12] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x7fa80d7d7e9a]
[3985fa320879:00182] [13] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x7fa8442ea144]
[3985fa320879:00182] [14] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7fa8442f0c05]
[3985fa320879:00182] [15] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x3ca)[0x7fa8464dafba]
[3985fa320879:00182] [16] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_basic_linear+0x119)[0x7fa846518949]
[3985fa320879:00182] [17] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7fa80c25b840]
[3985fa320879:00182] [18] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7fa8464f1c11]
[3985fa320879:00182] [19] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common12MPIBroadcast7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x3e2)[0x7fa80f2e25c2]
[3985fa320879:00182] [20] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteBroadcastERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fa80f2b752d]
[3985fa320879:00182] [21] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x151)[0x7fa80f2b7901]
[3985fa320879:00182] [22] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7fa80f28602d]
[3985fa320879:00182] [23] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7fa8ecd92de4]
[3985fa320879:00182] [24] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fa8f2b44609]
[3985fa320879:00182] [25] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fa8f2c7e133]
[3985fa320879:00182] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 0 on node 3985fa320879 exited on signal 6 (Aborted).

Mangesh_mane · June 20, 2023, 1:07pm

And tao command: -
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 2

Morganh · June 21, 2023, 2:15am

Please trigger docker

docker run --runtime=nvidia --shm-size=16g --ulimit memlock=-1 -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5 /bin/bash

Inside the docker, update MPI version to 4.1.5.

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.bz2
mkdir src
mv openmpi-4.1.5.tar.bz2 src/
cd src/
tar -jxf openmpi-4.1.5.tar.bz2
cd openmpi-4.1.5
./configure --prefix=$HOME/opt/openmpi
make -j128 all
make install
mpirun --version
echo “export PATH=$PATH:$HOME/opt/openmpi/bin” >> $HOME/.bashrc
echo “export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/opt/openmpi/lib” >> $HOME/.bashrc
. ~/.bashrc
export OPAL_PREFIX=$HOME/opt/openmpi/

Then run training while adding OMPI_MCA_btl_vader_single_copy_mechanism=none


OMPI_MCA_btl_vader_single_copy_mechanism=none yolo_v4 train -e your_spec.txt -r results -k key --gpus 2

Any issues, please let me know. Thanks a lot.

Mangesh_mane · June 21, 2023, 6:25am

Hi morganh,
It working but im assing 4 gpus get a below error

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 3 with PID 0 on node fae5471fc2a1 exited on signal 9 (Killed).

Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Morganh · June 21, 2023, 8:54am

Above log can be ignored. Can you share the full log via the button
?

Mangesh_mane · June 22, 2023, 5:19am

sorry for late response below are logs files
logs.txt (75.5 KB)
When I set batch size 2 and resolution to 640 X 384 and the GPU is set to 2, after 3 epochs, the train fails.
Below are the GPU details.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:08:00.0 Off | Off |
| 77% 90C P2 67W / 140W | 1436MiB / 16376MiB | 14% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA RTX A4000 Off | 00000000:09:00.0 Off | Off |
| 43% 62C P8 16W / 140W | 8MiB / 16376MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA RTX A4000 Off | 00000000:42:00.0 Off | Off |
| 44% 62C P8 16W / 140W | 8MiB / 16376MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA RTX A4000 Off | 00000000:43:00.0 Off | Off |
| 41% 47C P8 13W / 140W | 58MiB / 16376MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
model are yolov4 + darknet
epoch are 80

Morganh · June 22, 2023, 8:44am

Could you please try to use the latest 4.0.1 docker?
docker run --runtime=nvidia --shm-size=16g --ulimit memlock=-1 -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

And then use the same steps as mentioned above to update MPI version to 4.1.5 inside the docker.

Mangesh_mane · June 22, 2023, 6:19pm

im try with threee gpu with six batch but same error i get kindly check logs file
logs.txt (84.1 KB)

Morganh · June 23, 2023, 12:17am

Could you share the latest spec file? It looks like an OOM(out of memory) issue. Can you set a lower batch-size?

Mangesh_mane · June 23, 2023, 5:35am

yes ,sure
yolov4_spec.txt (2.2 KB)

Morganh · June 23, 2023, 10:14am

Can you run
$ nvidia-smi

Mangesh_mane · June 23, 2023, 10:33am

yes here is,
Fri Jun 23 16:03:16 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:08:00.0 Off | Off |
| 78% 92C P2 62W / 140W | 660MiB / 16376MiB | 12% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA RTX A4000 Off | 00000000:09:00.0 Off | Off |
| 96% 94C P2 99W / 140W | 2207MiB / 16376MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA RTX A4000 Off | 00000000:42:00.0 Off | Off |
|100% 96C P2 82W / 140W | 15192MiB / 16376MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA RTX A4000 Off | 00000000:43:00.0 Off | Off |
| 63% 81C P2 76W / 140W | 15158MiB / 16376MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1414 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 1648445 C /usr/bin/python3.6 156MiB |
| 0 N/A N/A 2161057 C python3 347MiB |
| 1 N/A N/A 1414 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2161057 C python3 2199MiB |
| 2 N/A N/A 1414 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1648541 C python3.6 15118MiB |
| 3 N/A N/A 1414 G /usr/lib/xorg/Xorg 46MiB |
| 3 N/A N/A 2034 G /usr/bin/gnome-shell 8MiB |
| 3 N/A N/A 1648542 C python3.6 15066MiB |
±----------------------------------------------------------------------------+

Morganh · June 24, 2023, 2:36pm

Can you please try below to check if there is still OOM?
experiment1: set mosaic_prob=0
experiment2: use fewer training images
experiment3: use resnet18 instead

Mangesh_mane · June 26, 2023, 6:04am

Sorry for delay reply currently my gpu system are busy forming the models can we do this experiment later.

Morganh · June 26, 2023, 6:15am

No problem. And just add two more experiments.
experiment4: set training batch size to 1.
experiment5: run with AMP enabled. Refer to Optimizing the Training Pipeline - NVIDIA Docs

Mangesh_mane · June 26, 2023, 8:17am

The fourth experiment is successful, while experiment number five is unsuccessful.
Two GPUs are operating on experiment number 4, and my training is continuing.

Morganh · July 2, 2023, 5:01pm

Thanks for the info.

system · July 25, 2023, 5:23am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
More than 1 GPU not working using Tao Train TAO Toolkit	47	4531	April 9, 2023
TAO training on multiple gpus failed TAO Toolkit	10	1149	March 9, 2023
Give me some instructions to improve mAP% from 0.0 % which was appeared executing the Notebook of TAO-Toolkit-Whitepaper-use-cases TAO Toolkit	19	763	May 4, 2024
Training got killed before start TAO Toolkit	18	1440	February 8, 2022
Training Become very slow Yolov4 TAO Toolkit	25	2099	January 25, 2022
Yolo_v4_tiny randomly stops docker container during second or third validation phase with no errors TAO Toolkit yolo	20	880	August 29, 2022
Error when evaluate PointPillar network TAO Toolkit	6	749	June 4, 2023
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1718	August 22, 2023
Multi GPU's and invalid loss TAO Toolkit	18	1174	July 19, 2022
Training Yolov4 with 4 GPUs cause out of memory TAO Toolkit	4	983	August 3, 2022

Yolov4 multi-gpu training with Darknet Arch encounters a problem

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 3 with PID 0 on node fae5471fc2a1 exited on signal 9 (Killed).

Related topics

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.