TAO API - Detectnet_v2 - Multi GPU Stuck

alejandro.granda · May 26, 2023, 8:59am

Please provide the following information when requesting support.

• Hardware: 2x RTXA6000 Ada
• Network Type: Detectnet_v2
• TLT Version: API 4.0.2 - TaoClient: 4.0.1
• How to reproduce the issue ?

I’m migrating the entire deployment to a new workstatition.

With 1 GPU is working with the both configurations: with --use_amp and without it.

Now enable the two GPU with the TAO Helm deployment ‘values.yaml’.
First test with --use_amp enabled:

The initial steps are good from the logs, and when really need to start to train, get freeze.
The GPU have the ram loaded, but the frequency in idle.

LOG:
7a82fad8-bdd8-4a23-87f6-5c2265c4ce95_use_amp.txt (81.4 KB)

Second test without --use_amp.

Also launch the process, load the GPU memory, and get stuck with the GPU frequency at high.

Now we have new messages with NCCL info:

LOG:
83ea0c25-6a06-4f80-8703-9759d21fa062_no_amp.txt (83.7 KB)

83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Bootstrap : Using eth0:172.163.22.145<0>
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO P2P plugin IBext
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO NET/IB : No device found.
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO NET/IB : No device found.
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO NET/Socket : Using [0]eth0:172.163.22.145<0>
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Using network Socket
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Channel 00/04 :    0   1
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Channel 01/04 :    0   1
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Channel 02/04 :    0   1
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Channel 03/04 :    0   1
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/IPC
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/IPC
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/IPC
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/IPC
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Connected all rings
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO Connected all trees
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
83ea0c25-6a06-4f80-8703-9759d21fa062-86pn5:175:454 [0] NCCL INFO comm 0x7f43e7c1a310 rank 0 nranks 2 cudaDev 0 busId 21000 - Init COMPLETE

The rocket appear to be launch… but get stuck here.

A picture from the status of the GPUS

Thanks in advance.

alejandro.granda · May 26, 2023, 11:16am

I try to follow the steps of this other posts:

But get stuck once started. I don’t know how many time need to finish. I’m wating more than 15 minutes and no movement or results.
The GPUs are lock at 100% of the clock frequency.
Also attach the nvidia-smi:

nvidia-smi

root@9ea20d6ac6f2:/workspace/nccl-tests# nvidia-smi
Fri May 26 11:20:48 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  Off  | 00000000:21:00.0 Off |                  Off |
| 32%   54C    P8    32W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  Off  | 00000000:22:00.0 Off |                  Off |
| 37%   58C    P8    41W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Attach the log:

root@9ea20d6ac6f2:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1034 on 9ea20d6ac6f2 device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1034 on 9ea20d6ac6f2 device  1 [0x22] NVIDIA RTX 6000 Ada Generation
9ea20d6ac6f2:1034:1034 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
9ea20d6ac6f2:1034:1034 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
9ea20d6ac6f2:1034:1034 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
9ea20d6ac6f2:1034:1034 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
9ea20d6ac6f2:1034:1034 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
9ea20d6ac6f2:1034:1034 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
9ea20d6ac6f2:1034:1043 [0] NCCL INFO P2P plugin IBext
9ea20d6ac6f2:1034:1043 [0] NCCL INFO NET/IB : No device found.
9ea20d6ac6f2:1034:1043 [0] NCCL INFO NET/IB : No device found.
9ea20d6ac6f2:1034:1043 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Using network Socket
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Using network Socket
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 00/04 :    0   1
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 01/04 :    0   1
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 02/04 :    0   1
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 03/04 :    0   1
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Connected all rings
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Connected all rings
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Connected all trees
9ea20d6ac6f2:1034:1044 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
9ea20d6ac6f2:1034:1044 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Connected all trees
9ea20d6ac6f2:1034:1043 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
9ea20d6ac6f2:1034:1043 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO comm 0x556ce5af0be0 rank 1 nranks 2 cudaDev 1 busId 22000 - Init COMPLETE
9ea20d6ac6f2:1034:1043 [0] NCCL INFO comm 0x556ce5aee150 rank 0 nranks 2 cudaDev 0 busId 21000 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

Morganh · May 26, 2023, 4:16pm

How about using and old version of TAO docker?
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tensorrt:22.11-py3 /bin/bash

Then inside the docker
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

More, can you add -shm-size=16g and --ulimit memlock=-1 in the docker command as well?
And also, before you run nccl test, please add export NCCL_DEBUG=INFO or export NCCL_DEBUG=WARN . Refer to https://github.com/NVIDIA/nccl/issues/411 and https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out

alejandro.granda · May 29, 2023, 8:18am

docker run --runtime=nvidia --shm-size=16g --ulimit memlock=-1 -it --rm nvcr.io/nvidia/tensorrt:22.11-py3 /bin/bash

Both commands added to the docker, and the same result, the GPU get stuck in the Frequency of 100% and nothing shows:

> export NCCL_DEBUG=TRACE

LOG

docker run --runtime=nvidia --shm-size=16g --ulimit memlock=-1 -it --rm nvcr.io/nvidia/tensorrt:22.11-py3 /bin/bash

./build/all_reduce_perf -b 8 -e 128M -f 2 -g2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1114 on 3c7bb4b1e648 device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1114 on 3c7bb4b1e648 device  1 [0x22] NVIDIA RTX 6000 Ada Generation
3c7bb4b1e648:1114:1114 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
3c7bb4b1e648:1114:1114 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
3c7bb4b1e648:1114:1114 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
3c7bb4b1e648:1114:1114 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
3c7bb4b1e648:1114:1114 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
3c7bb4b1e648:1114:1114 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
3c7bb4b1e648:1114:1123 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
3c7bb4b1e648:1114:1123 [0] NCCL INFO P2P plugin IBext
3c7bb4b1e648:1114:1123 [0] NCCL INFO NET/IB : No device found.
3c7bb4b1e648:1114:1123 [0] NCCL INFO NET/IB : No device found.
3c7bb4b1e648:1114:1123 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
3c7bb4b1e648:1114:1123 [0] NCCL INFO Using network Socket
3c7bb4b1e648:1114:1124 [1] NCCL INFO Using network Socket
3c7bb4b1e648:1114:1123 [0] NCCL INFO Channel 00/04 :    0   1
3c7bb4b1e648:1114:1123 [0] NCCL INFO Channel 01/04 :    0   1
3c7bb4b1e648:1114:1123 [0] NCCL INFO Channel 02/04 :    0   1
3c7bb4b1e648:1114:1123 [0] NCCL INFO Channel 03/04 :    0   1
3c7bb4b1e648:1114:1124 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
3c7bb4b1e648:1114:1123 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
3c7bb4b1e648:1114:1124 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
3c7bb4b1e648:1114:1123 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
3c7bb4b1e648:1114:1124 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
3c7bb4b1e648:1114:1123 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
3c7bb4b1e648:1114:1124 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
3c7bb4b1e648:1114:1123 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
3c7bb4b1e648:1114:1124 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
3c7bb4b1e648:1114:1123 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
3c7bb4b1e648:1114:1123 [0] NCCL INFO Connected all rings
3c7bb4b1e648:1114:1124 [1] NCCL INFO Connected all rings
3c7bb4b1e648:1114:1123 [0] NCCL INFO Connected all trees
3c7bb4b1e648:1114:1124 [1] NCCL INFO Connected all trees
3c7bb4b1e648:1114:1124 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
3c7bb4b1e648:1114:1124 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
3c7bb4b1e648:1114:1123 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
3c7bb4b1e648:1114:1123 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
3c7bb4b1e648:1114:1123 [0] NCCL INFO comm 0x556c2dad58c0 rank 0 nranks 2 cudaDev 0 busId 21000 - Init COMPLETE
3c7bb4b1e648:1114:1124 [1] NCCL INFO comm 0x556c2dad8350 rank 1 nranks 2 cudaDev 1 busId 22000 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

Wait, this last post have the solution!

export NCCL_P2P_LEVEL=NVL

LOG

root@3c7bb4b1e648:/workspace/nccl-tests# export NCCL_P2P_LEVEL=NVL
root@3c7bb4b1e648:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1129 on 3c7bb4b1e648 device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1129 on 3c7bb4b1e648 device  1 [0x22] NVIDIA RTX 6000 Ada Generation
3c7bb4b1e648:1129:1129 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
3c7bb4b1e648:1129:1129 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
3c7bb4b1e648:1129:1129 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
3c7bb4b1e648:1129:1129 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
3c7bb4b1e648:1129:1129 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
3c7bb4b1e648:1129:1129 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
3c7bb4b1e648:1129:1138 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
3c7bb4b1e648:1129:1138 [0] NCCL INFO P2P plugin IBext
3c7bb4b1e648:1129:1138 [0] NCCL INFO NET/IB : No device found.
3c7bb4b1e648:1129:1138 [0] NCCL INFO NET/IB : No device found.
3c7bb4b1e648:1129:1138 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
3c7bb4b1e648:1129:1138 [0] NCCL INFO Using network Socket
3c7bb4b1e648:1129:1139 [1] NCCL INFO Using network Socket
3c7bb4b1e648:1129:1138 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
3c7bb4b1e648:1129:1138 [0] NCCL INFO Channel 00/04 :    0   1
3c7bb4b1e648:1129:1138 [0] NCCL INFO Channel 01/04 :    0   1
3c7bb4b1e648:1129:1138 [0] NCCL INFO Channel 02/04 :    0   1
3c7bb4b1e648:1129:1138 [0] NCCL INFO Channel 03/04 :    0   1
3c7bb4b1e648:1129:1139 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
3c7bb4b1e648:1129:1138 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
3c7bb4b1e648:1129:1139 [1] NCCL INFO Channel 00 : 1[22000] -> 0[21000] via SHM/direct/direct
3c7bb4b1e648:1129:1138 [0] NCCL INFO Channel 00 : 0[21000] -> 1[22000] via SHM/direct/direct
3c7bb4b1e648:1129:1139 [1] NCCL INFO Channel 01 : 1[22000] -> 0[21000] via SHM/direct/direct
3c7bb4b1e648:1129:1138 [0] NCCL INFO Channel 01 : 0[21000] -> 1[22000] via SHM/direct/direct
3c7bb4b1e648:1129:1139 [1] NCCL INFO Channel 02 : 1[22000] -> 0[21000] via SHM/direct/direct
3c7bb4b1e648:1129:1138 [0] NCCL INFO Channel 02 : 0[21000] -> 1[22000] via SHM/direct/direct
3c7bb4b1e648:1129:1139 [1] NCCL INFO Channel 03 : 1[22000] -> 0[21000] via SHM/direct/direct
3c7bb4b1e648:1129:1138 [0] NCCL INFO Channel 03 : 0[21000] -> 1[22000] via SHM/direct/direct
3c7bb4b1e648:1129:1138 [0] NCCL INFO Connected all rings
3c7bb4b1e648:1129:1139 [1] NCCL INFO Connected all rings
3c7bb4b1e648:1129:1138 [0] NCCL INFO Connected all trees
3c7bb4b1e648:1129:1139 [1] NCCL INFO Connected all trees
3c7bb4b1e648:1129:1139 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
3c7bb4b1e648:1129:1139 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
3c7bb4b1e648:1129:1138 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
3c7bb4b1e648:1129:1138 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
3c7bb4b1e648:1129:1139 [1] NCCL INFO comm 0x56192d36d730 rank 1 nranks 2 cudaDev 1 busId 22000 - Init COMPLETE
3c7bb4b1e648:1129:1138 [0] NCCL INFO comm 0x56192d36aca0 rank 0 nranks 2 cudaDev 0 busId 21000 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     6.78    0.00    0.00      0     6.79    0.00    0.00      0
          16             4     float     sum      -1     7.48    0.00    0.00      0     6.79    0.00    0.00      0
          32             8     float     sum      -1     7.03    0.00    0.00      0     6.88    0.00    0.00      0
          64            16     float     sum      -1     7.03    0.01    0.01      0     6.99    0.01    0.01      0
         128            32     float     sum      -1     6.97    0.02    0.02      0     6.94    0.02    0.02      0
         256            64     float     sum      -1     7.15    0.04    0.04      0     6.98    0.04    0.04      0
         512           128     float     sum      -1     7.53    0.07    0.07      0     6.87    0.07    0.07      0
        1024           256     float     sum      -1     7.42    0.14    0.14      0     7.12    0.14    0.14      0
        2048           512     float     sum      -1     7.55    0.27    0.27      0     7.43    0.28    0.28      0
        4096          1024     float     sum      -1     7.85    0.52    0.52      0     7.67    0.53    0.53      0
        8192          2048     float     sum      -1     9.05    0.91    0.91      0     8.58    0.96    0.96      0
       16384          4096     float     sum      -1    10.66    1.54    1.54      0    10.44    1.57    1.57      0
       32768          8192     float     sum      -1    14.54    2.25    2.25      0    14.57    2.25    2.25      0
       65536         16384     float     sum      -1    22.07    2.97    2.97      0    22.10    2.97    2.97      0
      131072         32768     float     sum      -1    34.11    3.84    3.84      0    34.84    3.76    3.76      0
      262144         65536     float     sum      -1    56.44    4.64    4.64      0    67.68    3.87    3.87      0
      524288        131072     float     sum      -1    83.45    6.28    6.28      0    83.11    6.31    6.31      0
     1048576        262144     float     sum      -1    149.5    7.01    7.01      0    149.8    7.00    7.00      0
     2097152        524288     float     sum      -1    304.1    6.90    6.90      0    305.9    6.86    6.86      0
     4194304       1048576     float     sum      -1    587.6    7.14    7.14      0    582.4    7.20    7.20      0
     8388608       2097152     float     sum      -1   1149.4    7.30    7.30      0   1145.4    7.32    7.32      0
    16777216       4194304     float     sum      -1   2257.6    7.43    7.43      0   2258.2    7.43    7.43      0
    33554432       8388608     float     sum      -1   4516.1    7.43    7.43      0   4522.2    7.42    7.42      0
    67108864      16777216     float     sum      -1   9121.6    7.36    7.36      0   9162.4    7.32    7.32      0
   134217728      33554432     float     sum      -1    18424    7.28    7.28      0    18461    7.27    7.27      0
3c7bb4b1e648:1129:1129 [1] NCCL INFO comm 0x56192d36aca0 rank 0 nranks 2 cudaDev 0 busId 21000 - Destroy COMPLETE
3c7bb4b1e648:1129:1129 [1] NCCL INFO comm 0x56192d36d730 rank 1 nranks 2 cudaDev 1 busId 22000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.23917 
#

How can insert this in the TAO API pod?

Morganh · May 29, 2023, 8:41am

Thanks for the info. Could you please check if below works firstly?

docker run --runtime=nvidia --shm-size=16g --ulimit memlock=-1 -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then inside the docker,

mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 2 python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e spec.txt -r result -k key

alejandro.granda · May 29, 2023, 8:53am

Any special spec.txt file?
Or mount a used train spec file?

alejandro.granda · May 29, 2023, 9:18am

With a little of makeup:

mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 2 python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r result -k tlt_encode

Attach the log. Ends with a extrange code error.

log_nvidia.txt (110.9 KB)

Morganh · May 29, 2023, 9:43am

For 4.0.1 docker, the error is the same as https://forums.developer.nvidia.com/t/error-during-multi-gpu-training-of-classification-tf1-cma-ep-c-process-vm-readv-operation-not-permitted/ .
According to that topic, there are two options here.

You can use the 22.05 docker which has working mpi version of openmpi-4.1.2.
In 4.0.1 docker, change the mpi version.

# from https://edu.itp.phys.ethz.ch/hs12/programming_techniques/openmpi.pdf and https://www.open-mpi.org/software/ompi/v4.1/
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.bz2
mkdir src
mv openmpi-4.1.5.tar.bz2 src/
cd src/
tar -jxf openmpi-4.1.5.tar.bz2
cd openmpi-4.1.5
./configure --prefix=$HOME/opt/openmpi
make -j128 all
make install
mpirun --version
echo “export PATH=$PATH:$HOME/opt/openmpi/bin” >> $HOME/.bashrc
echo “export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/opt/openmpi/lib” >> $HOME/.bashrc
. ~/.bashrc
export OPAL_PREFIX=$HOME/opt/openmpi/

Then,

mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 2 python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r result -k tlt_encode

alejandro.granda · May 29, 2023, 10:28am

Have a better look.

I try to install it in the 4.0.1 docker, and appear that will work correctly. Now appear some missmatch to with the pictures and the labels, that i will try to fix now.

But in the meantime, how can apply that to the TAO API pods?

Morganh · May 29, 2023, 12:33pm

Thanks for the info. Could you please share the log?

For TAO API pod, I will check internally how to handle it.

alejandro.granda · May 29, 2023, 2:21pm

With the new mpirun and only with the exports that you give in the last comment have the same problem shown in the TAO API pod. Get stuck in the NCCL.

log_nvidia3.txt (105.4 KB)

Adding the export NCCL_P2P_LEVEL=NVL the train start!

log_nvidia3.txt (105.4 KB)

Morganh · May 29, 2023, 3:09pm

Seems that it is the same log as the one without adding export NCCL_P2P_LEVEL=NVL. Would you please share the exact one? Thanks a lot.

alejandro.granda · May 29, 2023, 3:12pm

Sorry my mistake:
log_nvidia3.txt (116.3 KB)

Morganh · May 29, 2023, 3:12pm

More, is it possible to share the result of $nvidia-smi topo -m?

More, for 4.0.1 docker, according to your description, even not update to openmpi-4.1.5.tar.bz2, just export NCCL_P2P_LEVEL=NVL will make the training working, right?

alejandro.granda · May 29, 2023, 3:14pm

In theory is updated to the last openmpi 4.1.5. in both logs.
One with the export NCCL_P2P_LEVEL=NVL and other without.

Here is the topo

nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	PHB	0-63		N/A
GPU1	PHB	 X 	0-63		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

and attach the mpi version:

root@cc8b63e0b034:/workspace/src/openmpi-4.1.5# mpirun --version
mpirun (Open MPI) 4.1.5

Report bugs to http://www.open-mpi.org/community/help/

Morganh · May 29, 2023, 3:24pm

Could you please help run several experiments?

Exp1: Check if below works.

$ unset NCCL_P2P_LEVEL
$ mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 2 NCCL_P2P_LEVEL=NVL python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode

Exp2: Check if below works.

$ unset NCCL_P2P_LEVEL
$ mpirun --allow-run-as-root -np 2 NCCL_P2P_LEVEL=NVL python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode

alejandro.granda · May 29, 2023, 3:31pm

In both cases the same error:

root@cc8b63e0b034:/workspace/src/openmpi-4.1.5# unset NCCL_P2P_LEVEL
root@cc8b63e0b034:/workspace/src/openmpi-4.1.5# env | grep NCCL_P2P_LEVEL
root@cc8b63e0b034:/workspace/src/openmpi-4.1.5# mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 2 NCCL_P2P_LEVEL=NVL python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       cc8b63e0b034
Executable: NCCL_P2P_LEVEL=NVL
--------------------------------------------------------------------------
2 total processes failed to start
root@cc8b63e0b034:/workspace/src/openmpi-4.1.5# mpirun --allow-run-as-root -np 2 NCCL_P2P_LEVEL=NVL python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       cc8b63e0b034
Executable: NCCL_P2P_LEVEL=NVL
--------------------------------------------------------------------------
2 total processes failed to start

Maybe more straightforward set the environmet variable in the docker command?

docker run --runtime=nvidia --shm-size=16g --ulimit memlock=-1 -e “NCCL_P2P_LEVEL=NVL” -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Morganh · May 29, 2023, 3:37pm

Yes, you are correct. Could you please trigger

docker run --runtime=nvidia --shm-size=16g --ulimit memlock=-1 -e NCCL_P2P_LEVEL=NVL -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then,

mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 2 python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode

and

mpirun --allow-run-as-root -np 2 python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode

alejandro.granda · May 29, 2023, 3:58pm

The same result, when inicialize the docker return to the default openmpi version:

root@41d3ea49ec8f:/workspace# mpirun --version
mpirun (Open MPI) 4.1.5a1

Report bugs to http://www.open-mpi.org/community/help/

root@41d3ea49ec8f:/workspace# env | grep NCCL
NCCL_VERSION=2.15.5
NCCL_P2P_LEVEL=NVL

Attach the final part of the log. The same problem that watch at the begining of the post. Only solve by the openmpi 4.1.5.

2023-05-29 15:34:40,472 [INFO] tensorflow: Done running local_init_op.
[41d3ea49ec8f:115  :0:391]      cma_ep.c:81   process_vm_writev(pid=116 {0x7f724858ad28,14761}-->{0x7f5b2c4f65c0,14761}) returned -1: Operation not permitted
==== backtrace (tid:    391) ====
 0 0x00000000000039f2 uct_cma_ep_tx_error()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:81
 1 0x0000000000003d66 uct_cma_ep_tx()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:114
 2 0x000000000001e209 uct_scopy_ep_progress_tx()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/base/scopy_ep.c:151
 3 0x00000000000516d6 ucs_arbiter_dispatch_nonempty()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.c:321
 4 0x000000000001dcf1 ucs_arbiter_dispatch()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.h:386
 5 0x0000000000052467 ucs_callbackq_slow_proxy()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.c:404
 6 0x000000000004be9a ucs_callbackq_dispatch()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.h:211
 7 0x000000000004be9a uct_worker_progress()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/api/uct.h:2647
 8 0x000000000004be9a ucp_worker_progress()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucp/core/ucp_worker.c:2804
 9 0x0000000000037144 opal_progress()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/runtime/opal_progress.c:231
10 0x000000000003dc05 ompi_sync_wait_mt()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/threads/wait_sync.c:85
11 0x0000000000055fba ompi_request_default_wait_all()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/request/req_wait.c:234
12 0x0000000000093949 ompi_coll_base_bcast_intra_basic_linear()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/coll_base_bcast.c:679
13 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:649
14 0x000000000006cc11 PMPI_Bcast()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:114
15 0x000000000006cc11 PMPI_Bcast()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:41
16 0x0000000000101ef1 horovod::common::MPIController::SendFinalTensors()  /opt/horovod/horovod/common/mpi/mpi_controller.cc:187
17 0x0000000000101ef1 std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()  /usr/include/c++/9/bits/basic_string.h:3706
18 0x0000000000101ef1 horovod::common::MPIController::SendFinalTensors()  /opt/horovod/horovod/common/mpi/mpi_controller.cc:182
19 0x0000000000085f3c horovod::common::Controller::ComputeResponseList()  /opt/horovod/horovod/common/controller.cc:427
20 0x00000000000a8a23 horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:756
21 0x00000000000a8a23 BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
23 0x0000000000008609 start_thread()  ???:0
24 0x000000000011f133 clone()  ???:0
=================================
[41d3ea49ec8f:00115] *** Process received signal ***
[41d3ea49ec8f:00115] Signal: Aborted (6)
[41d3ea49ec8f:00115] Signal code:  (-6)
[41d3ea49ec8f:00115] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f73eee71090]
[41d3ea49ec8f:00115] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f73eee7100b]
[41d3ea49ec8f:00115] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f73eee50859]
[41d3ea49ec8f:00115] [ 3] /opt/hpcx/ucx/lib/libucs.so.0(+0x5a7dd)[0x7f72412837dd]
[41d3ea49ec8f:00115] [ 4] /opt/hpcx/ucx/lib/libucs.so.0(+0x5fdc2)[0x7f7241288dc2]
[41d3ea49ec8f:00115] [ 5] /opt/hpcx/ucx/lib/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f7241289194]
[41d3ea49ec8f:00115] [ 6] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(+0x39f2)[0x7f72400b09f2]
[41d3ea49ec8f:00115] [ 7] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x186)[0x7f72400b0d66]
[41d3ea49ec8f:00115] [ 8] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x69)[0x7f7241208209]
[41d3ea49ec8f:00115] [ 9] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0xb6)[0x7f724127a6d6]
[41d3ea49ec8f:00115] [10] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x81)[0x7f7241207cf1]
[41d3ea49ec8f:00115] [11] /opt/hpcx/ucx/lib/libucs.so.0(+0x52467)[0x7f724127b467]
[41d3ea49ec8f:00115] [12] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x7f72413fbe9a]
[41d3ea49ec8f:00115] [13] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x7f7243fe6144]
[41d3ea49ec8f:00115] [14] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7f7243fecc05]
[41d3ea49ec8f:00115] [15] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x3ca)[0x7f72441d3fba]
[41d3ea49ec8f:00115] [16] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_basic_linear+0x119)[0x7f7244211949]
[41d3ea49ec8f:00115] [17] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7f7224a01840]
[41d3ea49ec8f:00115] [18] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7f72441eac11]
[41d3ea49ec8f:00115] [19] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13MPIController16SendFinalTensorsERNS0_12ResponseListE+0x91)[0x7f72443b2ef1]
[41d3ea49ec8f:00115] [20] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common10Controller19ComputeResponseListEbRNS0_18HorovodGlobalStateERNS0_10ProcessSetE+0x1d3c)[0x7f7244336f3c]
[41d3ea49ec8f:00115] [21] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa8a23)[0x7f7244359a23]
[41d3ea49ec8f:00115] [22] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f73ee1d9de4]
[41d3ea49ec8f:00115] [23] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f73eee13609]
[41d3ea49ec8f:00115] [24] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f73eef4d133]
[41d3ea49ec8f:00115] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 41d3ea49ec8f exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Morganh · May 29, 2023, 4:07pm

OK. For temp use, we can generate a new 4.0.1 docker.
Firstly, do some changes as mentioned above to install 4.1.5 MPI for 4.0.1 docker.
Then, open a new terminal to generate a new docker.
$ docker commit <container_id> image_name (container_id is from “docker ps”)

We can override the initial 4.0.1 docker when set image_name to the same as 4.0.1 docker.