Yolov4 multi-gpu training with Darknet Arch encounters a problem

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : - NVIDIA RTX A4000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) :- yolov4
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) :- nvidia/tao/tao-toolkit:4.0.0-tf2.9.1

3985fa320879:182:280 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
3985fa320879:182:280 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.15.1+cuda11.8
3985fa320879:182:280 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
3985fa320879:182:280 [0] NCCL INFO P2P plugin IBext
3985fa320879:182:280 [0] NCCL INFO NET/IB : No device found.
3985fa320879:182:280 [0] NCCL INFO NET/IB : No device found.
3985fa320879:182:280 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
3985fa320879:182:280 [0] NCCL INFO Using network Socket
3985fa320879:182:280 [0] NCCL INFO Setting affinity for GPU 0 to 0f0f
3985fa320879:182:280 [0] NCCL INFO Channel 00/02 : 0 1
3985fa320879:182:280 [0] NCCL INFO Channel 01/02 : 0 1
3985fa320879:182:280 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
3985fa320879:182:280 [0] NCCL INFO Channel 00/0 : 0[8000] → 1[9000] via P2P/IPC
3985fa320879:182:280 [0] NCCL INFO Channel 01/0 : 0[8000] → 1[9000] via P2P/IPC
3985fa320879:182:280 [0] NCCL INFO Connected all rings
3985fa320879:182:280 [0] NCCL INFO Connected all trees
3985fa320879:182:280 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
3985fa320879:182:280 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
3985fa320879:182:280 [0] NCCL INFO comm 0x7fa80bbc2c60 rank 0 nranks 2 cudaDev 0 busId 8000 - Init COMPLETE
1/5934 […] - ETA: 56:01:36 - loss: 14555.9795[3985fa320879:182 :0:280] cma_ep.c:81 process_vm_writev(pid=183 {0x7fa227338000,21504}–>{0x7f89fef35500,21504}) returned -1: Operation not permitted
==== backtrace (tid: 280) ====
0 0x00000000000039f2 uct_cma_ep_tx_error() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:81
1 0x0000000000003d66 uct_cma_ep_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:114
2 0x000000000001e209 uct_scopy_ep_progress_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/base/scopy_ep.c:151
3 0x00000000000516d6 ucs_arbiter_dispatch_nonempty() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.c:321
4 0x000000000001dcf1 ucs_arbiter_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.h:386
5 0x0000000000052467 ucs_callbackq_slow_proxy() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.c:404
6 0x000000000004be9a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.h:211
7 0x000000000004be9a uct_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/api/uct.h:2647
8 0x000000000004be9a ucp_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucp/core/ucp_worker.c:2804
9 0x0000000000037144 opal_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/runtime/opal_progress.c:231
10 0x000000000003dc05 ompi_sync_wait_mt() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/threads/wait_sync.c:85
11 0x0000000000055fba ompi_request_default_wait_all() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/request/req_wait.c:234
12 0x0000000000093949 ompi_coll_base_bcast_intra_basic_linear() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/coll_base_bcast.c:679
13 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
14 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:114
15 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:41
16 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:395
17 0x00000000001055c2 horovod::common::TensorShape::~TensorShape() /opt/horovod/horovod/common/ops/…/common.h:234
18 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:396
19 0x00000000000da52d horovod::common::OperationManager::ExecuteBroadcast() /opt/horovod/horovod/common/ops/operation_manager.cc:66
20 0x00000000000da901 horovod::common::OperationManager::ExecuteOperation() /opt/horovod/horovod/common/ops/operation_manager.cc:116
21 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:297
22 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=() /usr/include/c++/9/bits/shared_ptr_base.h:1265
23 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=() /usr/include/c++/9/bits/shared_ptr.h:335
24 0x00000000000a902d horovod::common::Event::operator=() /opt/horovod/horovod/common/common.h:185
25 0x00000000000a902d horovod::common::Status::operator=() /opt/horovod/horovod/common/common.h:197
26 0x00000000000a902d PerformOperation() /opt/horovod/horovod/common/operations.cc:297
27 0x00000000000a902d RunLoopOnce() /opt/horovod/horovod/common/operations.cc:787
28 0x00000000000a902d BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:651
29 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
30 0x0000000000008609 start_thread() ???:0
31 0x000000000011f133 clone() ???:0

[3985fa320879:00182] *** Process received signal ***
[3985fa320879:00182] Signal: Aborted (6)
[3985fa320879:00182] Signal code: (-6)
[3985fa320879:00182] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fa8f2ba2090]
[3985fa320879:00182] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fa8f2ba200b]
[3985fa320879:00182] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fa8f2b81859]
[3985fa320879:00182] [ 3] /opt/hpcx/ucx/lib/libucs.so.0(+0x5a7dd)[0x7fa80d65f7dd]
[3985fa320879:00182] [ 4] /opt/hpcx/ucx/lib/libucs.so.0(+0x5fdc2)[0x7fa80d664dc2]
[3985fa320879:00182] [ 5] /opt/hpcx/ucx/lib/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fa80d665194]
[3985fa320879:00182] [ 6] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(+0x39f2)[0x7fa80c4a69f2]
[3985fa320879:00182] [ 7] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x186)[0x7fa80c4a6d66]
[3985fa320879:00182] [ 8] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x69)[0x7fa80d5e4209]
[3985fa320879:00182] [ 9] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0xb6)[0x7fa80d6566d6]
[3985fa320879:00182] [10] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x81)[0x7fa80d5e3cf1]
[3985fa320879:00182] [11] /opt/hpcx/ucx/lib/libucs.so.0(+0x52467)[0x7fa80d657467]
[3985fa320879:00182] [12] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x7fa80d7d7e9a]
[3985fa320879:00182] [13] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x7fa8442ea144]
[3985fa320879:00182] [14] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7fa8442f0c05]
[3985fa320879:00182] [15] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x3ca)[0x7fa8464dafba]
[3985fa320879:00182] [16] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_basic_linear+0x119)[0x7fa846518949]
[3985fa320879:00182] [17] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7fa80c25b840]
[3985fa320879:00182] [18] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7fa8464f1c11]
[3985fa320879:00182] [19] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common12MPIBroadcast7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x3e2)[0x7fa80f2e25c2]
[3985fa320879:00182] [20] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteBroadcastERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fa80f2b752d]
[3985fa320879:00182] [21] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x151)[0x7fa80f2b7901]
[3985fa320879:00182] [22] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7fa80f28602d]
[3985fa320879:00182] [23] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7fa8ecd92de4]
[3985fa320879:00182] [24] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fa8f2b44609]
[3985fa320879:00182] [25] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fa8f2c7e133]
[3985fa320879:00182] *** End of error message ***


Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 0 with PID 0 on node 3985fa320879 exited on signal 6 (Aborted).

And tao command: -
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 2

Please trigger docker

docker run --runtime=nvidia --shm-size=16g --ulimit memlock=-1 -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5 /bin/bash

Inside the docker, update MPI version to 4.1.5.

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.bz2
mkdir src
mv openmpi-4.1.5.tar.bz2 src/
cd src/
tar -jxf openmpi-4.1.5.tar.bz2
cd openmpi-4.1.5
./configure --prefix=$HOME/opt/openmpi
make -j128 all
make install
mpirun --version
echo “export PATH=$PATH:$HOME/opt/openmpi/bin” >> $HOME/.bashrc
echo “export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/opt/openmpi/lib” >> $HOME/.bashrc
. ~/.bashrc
export OPAL_PREFIX=$HOME/opt/openmpi/

Then run training while adding OMPI_MCA_btl_vader_single_copy_mechanism=none


OMPI_MCA_btl_vader_single_copy_mechanism=none yolo_v4 train -e your_spec.txt -r results -k key --gpus 2

Any issues, please let me know. Thanks a lot.

Hi morganh,
It working but im assing 4 gpus get a below error

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 3 with PID 0 on node fae5471fc2a1 exited on signal 9 (Killed).

Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Above log can be ignored. Can you share the full log via the button
image ?

sorry for late response below are logs files
logs.txt (75.5 KB)
When I set batch size 2 and resolution to 640 X 384 and the GPU is set to 2, after 3 epochs, the train fails.
Below are the GPU details.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:08:00.0 Off | Off |
| 77% 90C P2 67W / 140W | 1436MiB / 16376MiB | 14% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA RTX A4000 Off | 00000000:09:00.0 Off | Off |
| 43% 62C P8 16W / 140W | 8MiB / 16376MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA RTX A4000 Off | 00000000:42:00.0 Off | Off |
| 44% 62C P8 16W / 140W | 8MiB / 16376MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA RTX A4000 Off | 00000000:43:00.0 Off | Off |
| 41% 47C P8 13W / 140W | 58MiB / 16376MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
model are yolov4 + darknet
epoch are 80

Could you please try to use the latest 4.0.1 docker?
docker run --runtime=nvidia --shm-size=16g --ulimit memlock=-1 -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

And then use the same steps as mentioned above to update MPI version to 4.1.5 inside the docker.

im try with threee gpu with six batch but same error i get kindly check logs file
logs.txt (84.1 KB)

Could you share the latest spec file? It looks like an OOM(out of memory) issue. Can you set a lower batch-size?

yes ,sure
yolov4_spec.txt (2.2 KB)

Can you run
$ nvidia-smi

yes here is,
Fri Jun 23 16:03:16 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:08:00.0 Off | Off |
| 78% 92C P2 62W / 140W | 660MiB / 16376MiB | 12% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA RTX A4000 Off | 00000000:09:00.0 Off | Off |
| 96% 94C P2 99W / 140W | 2207MiB / 16376MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA RTX A4000 Off | 00000000:42:00.0 Off | Off |
|100% 96C P2 82W / 140W | 15192MiB / 16376MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA RTX A4000 Off | 00000000:43:00.0 Off | Off |
| 63% 81C P2 76W / 140W | 15158MiB / 16376MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1414 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 1648445 C /usr/bin/python3.6 156MiB |
| 0 N/A N/A 2161057 C python3 347MiB |
| 1 N/A N/A 1414 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2161057 C python3 2199MiB |
| 2 N/A N/A 1414 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1648541 C python3.6 15118MiB |
| 3 N/A N/A 1414 G /usr/lib/xorg/Xorg 46MiB |
| 3 N/A N/A 2034 G /usr/bin/gnome-shell 8MiB |
| 3 N/A N/A 1648542 C python3.6 15066MiB |
±----------------------------------------------------------------------------+

Can you please try below to check if there is still OOM?
experiment1: set mosaic_prob=0
experiment2: use fewer training images
experiment3: use resnet18 instead

Sorry for delay reply currently my gpu system are busy forming the models can we do this experiment later.

No problem. And just add two more experiments.
experiment4: set training batch size to 1.
experiment5: run with AMP enabled. Refer to Optimizing the Training Pipeline - NVIDIA Docs

The fourth experiment is successful, while experiment number five is unsuccessful.
Two GPUs are operating on experiment number 4, and my training is continuing.

1 Like

Thanks for the info.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.