TAO training on multiple gpus failed

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Classification)
• TLT Version (TAO 4)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Training on multiple gpus has failed and I got the following error message:
2023-03-07 13:03:24,168 [INFO] root: Starting Training Loop.
Epoch 1/80
04516cf31dfa:209:285 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
04516cf31dfa:209:285 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
04516cf31dfa:209:285 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
04516cf31dfa:209:285 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
04516cf31dfa:209:285 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
04516cf31dfa:209:285 [0] NCCL INFO cudaDriverVersion 11060
NCCL version 2.15.1+cuda11.8
04516cf31dfa:209:285 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
04516cf31dfa:209:285 [0] NCCL INFO P2P plugin IBext
04516cf31dfa:209:285 [0] NCCL INFO NET/IB : No device found.
04516cf31dfa:209:285 [0] NCCL INFO NET/IB : No device found.
04516cf31dfa:209:285 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
04516cf31dfa:209:285 [0] NCCL INFO Using network Socket

04516cf31dfa:209:285 [0] graph/search.cc:885 NCCL WARN Could not find a path for pattern 4, falling back to simple order

04516cf31dfa:209:285 [0] graph/search.cc:885 NCCL WARN Could not find a path for pattern 1, falling back to simple order
04516cf31dfa:209:285 [0] NCCL INFO Channel 00/02 : 0 1
04516cf31dfa:209:285 [0] NCCL INFO Channel 01/02 : 0 1
04516cf31dfa:209:285 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
04516cf31dfa:209:285 [0] NCCL INFO Channel 00 : 0[1000] → 1[5000] via SHM/direct/direct
04516cf31dfa:209:285 [0] NCCL INFO Channel 01 : 0[1000] → 1[5000] via SHM/direct/direct
04516cf31dfa:209:285 [0] NCCL INFO Connected all rings
04516cf31dfa:209:285 [0] NCCL INFO Connected all trees
04516cf31dfa:209:285 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
04516cf31dfa:209:285 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
04516cf31dfa:209:285 [0] NCCL INFO comm 0x7fd2cbb87e90 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
1/91 […] - ETA: 18:40 - loss: 3.7309 - acc: 0.0781[04516cf31dfa:209 :0:285] cma_ep.c:81 process_vm_writev(pid=210 {0x7fd0e5456700,294912}–>{0x7fadb3c01000,294912}) returned -1: Operation not permitted
==== backtrace (tid: 285) ====
0 0x00000000000039f2 uct_cma_ep_tx_error() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:81
1 0x0000000000003d66 uct_cma_ep_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:114
2 0x000000000001e209 uct_scopy_ep_progress_tx() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/base/scopy_ep.c:151
3 0x00000000000516d6 ucs_arbiter_dispatch_nonempty() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.c:321
4 0x000000000001dcf1 ucs_arbiter_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.h:386
5 0x0000000000052467 ucs_callbackq_slow_proxy() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.c:404
6 0x000000000004be9a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.h:211
7 0x000000000004be9a uct_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/api/uct.h:2647
8 0x000000000004be9a ucp_worker_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucp/core/ucp_worker.c:2804
9 0x0000000000037144 opal_progress() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/runtime/opal_progress.c:231
10 0x000000000003dc05 ompi_sync_wait_mt() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/threads/wait_sync.c:85
11 0x0000000000055fba ompi_request_default_wait_all() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/request/req_wait.c:234
12 0x0000000000093949 ompi_coll_base_bcast_intra_basic_linear() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/coll_base_bcast.c:679
13 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
14 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:114
15 0x000000000006cc11 PMPI_Bcast() /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:41
16 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:395
17 0x00000000001055c2 horovod::common::TensorShape::~TensorShape() /opt/horovod/horovod/common/ops/…/common.h:234
18 0x00000000001055c2 horovod::common::MPIBroadcast::Execute() /opt/horovod/horovod/common/ops/mpi_operations.cc:396
19 0x00000000000da52d horovod::common::OperationManager::ExecuteBroadcast() /opt/horovod/horovod/common/ops/operation_manager.cc:66
20 0x00000000000da901 horovod::common::OperationManager::ExecuteOperation() /opt/horovod/horovod/common/ops/operation_manager.cc:116
21 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:297
22 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=() /usr/include/c++/9/bits/shared_ptr_base.h:1265
23 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=() /usr/include/c++/9/bits/shared_ptr.h:335
24 0x00000000000a902d horovod::common::Event::operator=() /opt/horovod/horovod/common/common.h:185
25 0x00000000000a902d horovod::common::Status::operator=() /opt/horovod/horovod/common/common.h:197
26 0x00000000000a902d PerformOperation() /opt/horovod/horovod/common/operations.cc:297
27 0x00000000000a902d RunLoopOnce() /opt/horovod/horovod/common/operations.cc:787
28 0x00000000000a902d BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:651
29 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
30 0x0000000000008609 start_thread() ???:0
31 0x000000000011f133 clone() ???:0

[04516cf31dfa:00209] *** Process received signal ***
[04516cf31dfa:00209] Signal: Aborted (6)
[04516cf31dfa:00209] Signal code: (-6)
[04516cf31dfa:00209] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fd3a87d7090]
[04516cf31dfa:00209] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fd3a87d700b]
[04516cf31dfa:00209] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fd3a87b6859]
[04516cf31dfa:00209] [ 3] /opt/hpcx/ucx/lib/libucs.so.0(+0x5a7dd)[0x7fd2cc2d27dd]
[04516cf31dfa:00209] [ 4] /opt/hpcx/ucx/lib/libucs.so.0(+0x5fdc2)[0x7fd2cc2d7dc2]
[04516cf31dfa:00209] [ 5] /opt/hpcx/ucx/lib/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fd2cc2d8194]
[04516cf31dfa:00209] [ 6] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(+0x39f2)[0x7fd2cc2229f2]
[04516cf31dfa:00209] [ 7] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x186)[0x7fd2cc222d66]
[04516cf31dfa:00209] [ 8] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x69)[0x7fd2ce111209]
[04516cf31dfa:00209] [ 9] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0xb6)[0x7fd2cc2c96d6]
[04516cf31dfa:00209] [10] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x81)[0x7fd2ce110cf1]
[04516cf31dfa:00209] [11] /opt/hpcx/ucx/lib/libucs.so.0(+0x52467)[0x7fd2cc2ca467]
[04516cf31dfa:00209] [12] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x7fd2ce185e9a]
[04516cf31dfa:00209] [13] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x7fd2d05bb144]
[04516cf31dfa:00209] [14] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7fd2d05c1c05]
[04516cf31dfa:00209] [15] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x3ca)[0x7fd2d07a8fba]
[04516cf31dfa:00209] [16] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_basic_linear+0x119)[0x7fd2d07e6949]
[04516cf31dfa:00209] [17] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7fd2b5645840]
[04516cf31dfa:00209] [18] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7fd2d07bfc11]
[04516cf31dfa:00209] [19] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common12MPIBroadcast7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x3e2)[0x7fd2d098b5c2]
[04516cf31dfa:00209] [20] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteBroadcastERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fd2d096052d]
[04516cf31dfa:00209] [21] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x151)[0x7fd2d0960901]
[04516cf31dfa:00209] [22] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7fd2d092f02d]
[04516cf31dfa:00209] [23] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7fd3a774bde4]
[04516cf31dfa:00209] [24] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fd3a8779609]
[04516cf31dfa:00209] [25] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fd3a88b3133]
[04516cf31dfa:00209] *** End of error message ***

I removed the 530 driver and installed 510 as suggested in similar threads but the issue is still there.

Can you try 520 driver?
$ sudo apt purge nvidia-driver-510
$ sudo apt autoremove
$ sudo apt autoclean

$ sudo apt install nvidia-driver-520

BTW, can you run nccl test inside tao docker?

$ tao classification run /bin/bash

Then, do tests with GitHub - NVIDIA/nccl-tests: NCCL Tests

I already tried with the 520 driver but I didn’t work out.
I run the: $ tao classification run /bin/bash and logged into the docker (just changed classification by classification_tf1)
I couldn’t locate the home directory of NCCL

Got it. Can you git clone it to run some tests? This is just in order to narrow down.

More, can you share the result of $nvidia-smi ?

here is the nvidia-smi output screen

for the nccl tests do I need it to run them inside the tao docker, if so I already attempted to do that but I ran into two problems
1-I couldn’t find the NCCL_HOME
2- I can’t download the test files from git as I have no permission to write on docker:

Can you update to 520 driver and share $nvidia-smi?
$ sudo apt purge nvidia-driver-510
$ sudo apt autoremove
$ sudo apt autoclean

$ sudo apt install nvidia-driver-520

$ nvidia-smi

It seems that 520 driver has serious issues with Ubuntu 22.04 as the screen went all black after rebooting (following the installation of that specific driver). I re-installed Linux and all the required packages and changed to 515 version of the driver but the problem is still there

According to your comments, you meet the nccl issue with 510 driver + Ubuntu 22.04 .
Do you have a chance to check if Ubuntu 20.04? From TAO Toolkit Quick Start Guide, Ubuntu 20.04 is verified.
Not sure the status of Ubuntu 22.04.

The problem remains after I installed several driver versions 510, 515…530.
It used to work correctly when I installed TAO 4 on Ubuntu 22.04 like 2 months back. But something happens with one of the updates that I cannot spot exactly.
I tried to reproduce the original settings but no success.
Could it be related the the latest release of TAO?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

You can "docker pull " one of below dockers to narrow down.

Then use below to run inside the docker.
$ docker run --runtime=nvidia -it --rm <docker name> /bin/bash

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.