Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) : - NVIDIA RTX A4000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) :- yolov4
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) :- nvidia/tao/tao-toolkit:4.0.0-tf2.9.1
3985fa320879:182:280 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
3985fa320879:182:280 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
3985fa320879:182:280 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.15.1+cuda11.8
3985fa320879:182:280 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
3985fa320879:182:280 [0] NCCL INFO P2P plugin IBext
3985fa320879:182:280 [0] NCCL INFO NET/IB : No device found.
3985fa320879:182:280 [0] NCCL INFO NET/IB : No device found.
3985fa320879:182:280 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
3985fa320879:182:280 [0] NCCL INFO Using network Socket
3985fa320879:182:280 [0] NCCL INFO Setting affinity for GPU 0 to 0f0f
3985fa320879:182:280 [0] NCCL INFO Channel 00/02 : 0 1
3985fa320879:182:280 [0] NCCL INFO Channel 01/02 : 0 1
3985fa320879:182:280 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
3985fa320879:182:280 [0] NCCL INFO Channel 00/0 : 0[8000] → 1[9000] via P2P/IPC
3985fa320879:182:280 [0] NCCL INFO Channel 01/0 : 0[8000] → 1[9000] via P2P/IPC
3985fa320879:182:280 [0] NCCL INFO Connected all rings
3985fa320879:182:280 [0] NCCL INFO Connected all trees
3985fa320879:182:280 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
3985fa320879:182:280 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
3985fa320879:182:280 [0] NCCL INFO comm 0x7fa80bbc2c60 rank 0 nranks 2 cudaDev 0 busId 8000 - Init COMPLETE
1/5934 […] - ETA: 56:01:36 - loss: 14555.9795[3985fa320879:182 :0:280] cma_ep.c:81 process_vm_writev(pid=183 {0x7fa227338000,21504}–>{0x7f89fef35500,21504}) returned -1: Operation not permitted
==== backtrace (tid: 280) ====
0 0x00000000000039f2 uct_cma_ep_tx_error() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:81
1 0x0000000000003d66 uct_cma_ep_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:114
2 0x000000000001e209 uct_scopy_ep_progress_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/base/scopy_ep.c:151
3 0x00000000000516d6 ucs_arbiter_dispatch_nonempty() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.c:321
4 0x000000000001dcf1 ucs_arbiter_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.h:386
5 0x0000000000052467 ucs_callbackq_slow_proxy() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.c:404
6 0x000000000004be9a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.h:211
7 0x000000000004be9a uct_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/api/uct.h:2647
8 0x000000000004be9a ucp_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucp/core/ucp_worker.c:2804
9 0x0000000000037144 opal_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/runtime/opal_progress.c:231
10 0x000000000003dc05 ompi_sync_wait_mt() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/threads/wait_sync.c:85
11 0x0000000000055fba ompi_request_default_wait_all() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/request/req_wait.c:234
12 0x0000000000093949 ompi_coll_base_bcast_intra_basic_linear() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/coll_base_bcast.c:679
13 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
14 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:114
15 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:41
16 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:395
17 0x00000000001055c2 horovod::common::TensorShape::~TensorShape() /opt/horovod/horovod/common/ops/…/common.h:234
18 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:396
19 0x00000000000da52d horovod::common::OperationManager::ExecuteBroadcast() /opt/horovod/horovod/common/ops/operation_manager.cc:66
20 0x00000000000da901 horovod::common::OperationManager::ExecuteOperation() /opt/horovod/horovod/common/ops/operation_manager.cc:116
21 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:297
22 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=() /usr/include/c++/9/bits/shared_ptr_base.h:1265
23 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=() /usr/include/c++/9/bits/shared_ptr.h:335
24 0x00000000000a902d horovod::common::Event::operator=() /opt/horovod/horovod/common/common.h:185
25 0x00000000000a902d horovod::common::Status::operator=() /opt/horovod/horovod/common/common.h:197
26 0x00000000000a902d PerformOperation() /opt/horovod/horovod/common/operations.cc:297
27 0x00000000000a902d RunLoopOnce() /opt/horovod/horovod/common/operations.cc:787
28 0x00000000000a902d BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:651
29 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
30 0x0000000000008609 start_thread() ???:0
31 0x000000000011f133 clone() ???:0
[3985fa320879:00182] *** Process received signal ***
[3985fa320879:00182] Signal: Aborted (6)
[3985fa320879:00182] Signal code: (-6)
[3985fa320879:00182] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fa8f2ba2090]
[3985fa320879:00182] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fa8f2ba200b]
[3985fa320879:00182] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fa8f2b81859]
[3985fa320879:00182] [ 3] /opt/hpcx/ucx/lib/libucs.so.0(+0x5a7dd)[0x7fa80d65f7dd]
[3985fa320879:00182] [ 4] /opt/hpcx/ucx/lib/libucs.so.0(+0x5fdc2)[0x7fa80d664dc2]
[3985fa320879:00182] [ 5] /opt/hpcx/ucx/lib/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fa80d665194]
[3985fa320879:00182] [ 6] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(+0x39f2)[0x7fa80c4a69f2]
[3985fa320879:00182] [ 7] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x186)[0x7fa80c4a6d66]
[3985fa320879:00182] [ 8] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x69)[0x7fa80d5e4209]
[3985fa320879:00182] [ 9] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0xb6)[0x7fa80d6566d6]
[3985fa320879:00182] [10] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x81)[0x7fa80d5e3cf1]
[3985fa320879:00182] [11] /opt/hpcx/ucx/lib/libucs.so.0(+0x52467)[0x7fa80d657467]
[3985fa320879:00182] [12] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x7fa80d7d7e9a]
[3985fa320879:00182] [13] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x7fa8442ea144]
[3985fa320879:00182] [14] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7fa8442f0c05]
[3985fa320879:00182] [15] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x3ca)[0x7fa8464dafba]
[3985fa320879:00182] [16] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_basic_linear+0x119)[0x7fa846518949]
[3985fa320879:00182] [17] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7fa80c25b840]
[3985fa320879:00182] [18] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7fa8464f1c11]
[3985fa320879:00182] [19] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common12MPIBroadcast7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x3e2)[0x7fa80f2e25c2]
[3985fa320879:00182] [20] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteBroadcastERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fa80f2b752d]
[3985fa320879:00182] [21] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x151)[0x7fa80f2b7901]
[3985fa320879:00182] [22] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7fa80f28602d]
[3985fa320879:00182] [23] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7fa8ecd92de4]
[3985fa320879:00182] [24] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fa8f2b44609]
[3985fa320879:00182] [25] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fa8f2c7e133]
[3985fa320879:00182] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node 3985fa320879 exited on signal 6 (Aborted).