Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Classification)
• TLT Version (TAO 4)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Training on multiple gpus has failed and I got the following error message:
2023-03-07 13:03:24,168 [INFO] root: Starting Training Loop.
Epoch 1/80
04516cf31dfa:209:285 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
04516cf31dfa:209:285 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
04516cf31dfa:209:285 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
04516cf31dfa:209:285 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
04516cf31dfa:209:285 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
04516cf31dfa:209:285 [0] NCCL INFO cudaDriverVersion 11060
NCCL version 2.15.1+cuda11.8
04516cf31dfa:209:285 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
04516cf31dfa:209:285 [0] NCCL INFO P2P plugin IBext
04516cf31dfa:209:285 [0] NCCL INFO NET/IB : No device found.
04516cf31dfa:209:285 [0] NCCL INFO NET/IB : No device found.
04516cf31dfa:209:285 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
04516cf31dfa:209:285 [0] NCCL INFO Using network Socket
04516cf31dfa:209:285 [0] graph/search.cc:885 NCCL WARN Could not find a path for pattern 4, falling back to simple order
04516cf31dfa:209:285 [0] graph/search.cc:885 NCCL WARN Could not find a path for pattern 1, falling back to simple order
04516cf31dfa:209:285 [0] NCCL INFO Channel 00/02 : 0 1
04516cf31dfa:209:285 [0] NCCL INFO Channel 01/02 : 0 1
04516cf31dfa:209:285 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
04516cf31dfa:209:285 [0] NCCL INFO Channel 00 : 0[1000] → 1[5000] via SHM/direct/direct
04516cf31dfa:209:285 [0] NCCL INFO Channel 01 : 0[1000] → 1[5000] via SHM/direct/direct
04516cf31dfa:209:285 [0] NCCL INFO Connected all rings
04516cf31dfa:209:285 [0] NCCL INFO Connected all trees
04516cf31dfa:209:285 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
04516cf31dfa:209:285 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
04516cf31dfa:209:285 [0] NCCL INFO comm 0x7fd2cbb87e90 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
1/91 […] - ETA: 18:40 - loss: 3.7309 - acc: 0.0781[04516cf31dfa:209 :0:285] cma_ep.c:81 process_vm_writev(pid=210 {0x7fd0e5456700,294912}–>{0x7fadb3c01000,294912}) returned -1: Operation not permitted
==== backtrace (tid: 285) ====
0 0x00000000000039f2 uct_cma_ep_tx_error() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:81
1 0x0000000000003d66 uct_cma_ep_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:114
2 0x000000000001e209 uct_scopy_ep_progress_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/base/scopy_ep.c:151
3 0x00000000000516d6 ucs_arbiter_dispatch_nonempty() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.c:321
4 0x000000000001dcf1 ucs_arbiter_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.h:386
5 0x0000000000052467 ucs_callbackq_slow_proxy() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.c:404
6 0x000000000004be9a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.h:211
7 0x000000000004be9a uct_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/api/uct.h:2647
8 0x000000000004be9a ucp_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucp/core/ucp_worker.c:2804
9 0x0000000000037144 opal_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/runtime/opal_progress.c:231
10 0x000000000003dc05 ompi_sync_wait_mt() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/threads/wait_sync.c:85
11 0x0000000000055fba ompi_request_default_wait_all() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/request/req_wait.c:234
12 0x0000000000093949 ompi_coll_base_bcast_intra_basic_linear() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/coll_base_bcast.c:679
13 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
14 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:114
15 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:41
16 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:395
17 0x00000000001055c2 horovod::common::TensorShape::~TensorShape() /opt/horovod/horovod/common/ops/…/common.h:234
18 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:396
19 0x00000000000da52d horovod::common::OperationManager::ExecuteBroadcast() /opt/horovod/horovod/common/ops/operation_manager.cc:66
20 0x00000000000da901 horovod::common::OperationManager::ExecuteOperation() /opt/horovod/horovod/common/ops/operation_manager.cc:116
21 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:297
22 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=() /usr/include/c++/9/bits/shared_ptr_base.h:1265
23 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=() /usr/include/c++/9/bits/shared_ptr.h:335
24 0x00000000000a902d horovod::common::Event::operator=() /opt/horovod/horovod/common/common.h:185
25 0x00000000000a902d horovod::common::Status::operator=() /opt/horovod/horovod/common/common.h:197
26 0x00000000000a902d PerformOperation() /opt/horovod/horovod/common/operations.cc:297
27 0x00000000000a902d RunLoopOnce() /opt/horovod/horovod/common/operations.cc:787
28 0x00000000000a902d BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:651
29 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
30 0x0000000000008609 start_thread() ???:0
31 0x000000000011f133 clone() ???:0
[04516cf31dfa:00209] *** Process received signal ***
[04516cf31dfa:00209] Signal: Aborted (6)
[04516cf31dfa:00209] Signal code: (-6)
[04516cf31dfa:00209] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fd3a87d7090]
[04516cf31dfa:00209] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fd3a87d700b]
[04516cf31dfa:00209] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fd3a87b6859]
[04516cf31dfa:00209] [ 3] /opt/hpcx/ucx/lib/libucs.so.0(+0x5a7dd)[0x7fd2cc2d27dd]
[04516cf31dfa:00209] [ 4] /opt/hpcx/ucx/lib/libucs.so.0(+0x5fdc2)[0x7fd2cc2d7dc2]
[04516cf31dfa:00209] [ 5] /opt/hpcx/ucx/lib/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fd2cc2d8194]
[04516cf31dfa:00209] [ 6] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(+0x39f2)[0x7fd2cc2229f2]
[04516cf31dfa:00209] [ 7] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x186)[0x7fd2cc222d66]
[04516cf31dfa:00209] [ 8] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x69)[0x7fd2ce111209]
[04516cf31dfa:00209] [ 9] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0xb6)[0x7fd2cc2c96d6]
[04516cf31dfa:00209] [10] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x81)[0x7fd2ce110cf1]
[04516cf31dfa:00209] [11] /opt/hpcx/ucx/lib/libucs.so.0(+0x52467)[0x7fd2cc2ca467]
[04516cf31dfa:00209] [12] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x7fd2ce185e9a]
[04516cf31dfa:00209] [13] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x7fd2d05bb144]
[04516cf31dfa:00209] [14] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7fd2d05c1c05]
[04516cf31dfa:00209] [15] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x3ca)[0x7fd2d07a8fba]
[04516cf31dfa:00209] [16] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_basic_linear+0x119)[0x7fd2d07e6949]
[04516cf31dfa:00209] [17] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7fd2b5645840]
[04516cf31dfa:00209] [18] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7fd2d07bfc11]
[04516cf31dfa:00209] [19] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common12MPIBroadcast7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x3e2)[0x7fd2d098b5c2]
[04516cf31dfa:00209] [20] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteBroadcastERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fd2d096052d]
[04516cf31dfa:00209] [21] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x151)[0x7fd2d0960901]
[04516cf31dfa:00209] [22] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7fd2d092f02d]
[04516cf31dfa:00209] [23] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7fd3a774bde4]
[04516cf31dfa:00209] [24] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fd3a8779609]
[04516cf31dfa:00209] [25] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fd3a88b3133]
[04516cf31dfa:00209] *** End of error message ***
I removed the 530 driver and installed 510 as suggested in similar threads but the issue is still there.