Error during multi-GPU training of classification_tf1: cma_ep.c process_vm_readv Operation not permitted

veritable · May 3, 2023, 8:23pm

I can train with a single GPU without any issues, but when I try to train with more than one GPU, I get the error:

cma_ep.c:81   process_vm_readv(pid=650 {0x7f7cf45dd3d0,16569}-->{0x7fe11c5efd28,16569}) returned -1: Operation not permitted

The error occurs immediately after the first batch of the first epoch.

The following TAO command runs fine with --gpus set to 1, but I get the above error when I set it to 2 or more:

tao classification_tf1 train -e /data/e1.cfg -r /results/e1 --gpus 1 -k nvidia_tlt

• Hardware: DGX-2 (16x V100s)
• Network Type: classification_tf1
• TAO Version:

Configuration of the TAO Toolkit Instance

dockers: 		
	nvidia/tao/tao-toolkit: 			
		4.0.0-tf2.9.1: 				
			docker_registry: nvcr.io
			tasks: 
				1. classification_tf2
				2. efficientdet_tf2
		4.0.0-tf1.15.5: 				
			docker_registry: nvcr.io
			tasks: 
				1. augment
				2. bpnet
				3. classification_tf1
				4. detectnet_v2
				5. dssd
				6. emotionnet
				7. efficientdet_tf1
				8. faster_rcnn
				9. fpenet
				10. gazenet
				11. gesturenet
				12. heartratenet
				13. lprnet
				14. mask_rcnn
				15. multitask_classification
				16. retinanet
				17. ssd
				18. unet
				19. yolo_v3
				20. yolo_v4
				21. yolo_v4_tiny
				22. converter
		4.0.1-tf1.15.5: 				
			docker_registry: nvcr.io
			tasks: 
				1. mask_rcnn
				2. unet
		4.0.0-pyt: 				
			docker_registry: nvcr.io
			tasks: 
				1. action_recognition
				2. deformable_detr
				3. segformer
				4. re_identification
				5. pointpillars
				6. pose_classification
				7. n_gram
				8. speech_to_text
				9. speech_to_text_citrinet
				10. speech_to_text_conformer
				11. spectro_gen
				12. vocoder
				13. text_classification
				14. question_answering
				15. token_classification
				16. intent_slot_classification
				17. punctuation_and_capitalization
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023

Here is the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:34:00.0 Off |                    0 |
| N/A   26C    P0    49W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  On   | 00000000:36:00.0 Off |                    0 |
| N/A   26C    P0    49W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM3...  On   | 00000000:39:00.0 Off |                    0 |
| N/A   32C    P0    50W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM3...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    49W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM3...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   26C    P0    47W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM3...  On   | 00000000:59:00.0 Off |                    0 |
| N/A   32C    P0    49W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM3...  On   | 00000000:5C:00.0 Off |                    0 |
| N/A   27C    P0    47W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM3...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   31C    P0    50W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   8  Tesla V100-SXM3...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   31C    P0    49W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   9  Tesla V100-SXM3...  On   | 00000000:B9:00.0 Off |                    0 |
| N/A   28C    P0    48W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|  10  Tesla V100-SXM3...  On   | 00000000:BC:00.0 Off |                    0 |
| N/A   34C    P0    48W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|  11  Tesla V100-SXM3...  On   | 00000000:BE:00.0 Off |                    0 |
| N/A   32C    P0    49W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|  12  Tesla V100-SXM3...  On   | 00000000:E0:00.0 Off |                    0 |
| N/A   31C    P0    48W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|  13  Tesla V100-SXM3...  On   | 00000000:E2:00.0 Off |                    0 |
| N/A   30C    P0    48W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|  14  Tesla V100-SXM3...  On   | 00000000:E5:00.0 Off |                    0 |
| N/A   35C    P0    48W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|  15  Tesla V100-SXM3...  On   | 00000000:E7:00.0 Off |                    0 |
| N/A   36C    P0    50W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Morganh · May 4, 2023, 2:14am

To narrow down, please login 4.0.1 docker directly and install old version of nccl as below. Then run training again. Thanks.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt install libnccl2=2.11.4-1+cuda11.6 libnccl-dev=2.11.4-1+cuda11.6
ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'

veritable · May 4, 2023, 2:04pm

Step 4 “add-apt-repository” returned a python error: “ModuleNotFoundError: No module named ‘apt_pkg’”. Please let me know if I should retry some modification of the first 4 steps.

After restarting my docker container and running only the last 3 commands (skipping the first 4), it appears to have worked, and here’s the output of the last step:

# ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'
/sbin/ldconfig.real: Path `/usr/local/cuda-11/targets/x86_64-linux/lib' given more than once
/sbin/ldconfig.real: Path `/usr/local/cuda/lib64' given more than once
/sbin/ldconfig.real: Can't stat /usr/local/nvidia/lib: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/nvidia/lib64: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/lib/x86_64-linux-gnu: No such file or directory
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/usr/lib' given more than once
/sbin/ldconfig.real: /lib/x86_64-linux-gnu/ld-2.31.so is the dynamic linker, ignoring

2.11.4

Unfortunately I am still getting the same error when I try to run my training.

Morganh · May 5, 2023, 2:40am

Can you run below experiments to check if multi gpus work?

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tensorrt:22.11-py3 /bin/bash

Then inside the docker
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

veritable · May 5, 2023, 1:29pm

Both tests ran successfully with similar output.

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1018 on 110db7276d2a device  0 [0x34] Tesla V100-SXM3-32GB
#  Rank  1 Group  0 Pid   1018 on 110db7276d2a device  1 [0x36] Tesla V100-SXM3-32GB
#  Rank  2 Group  0 Pid   1018 on 110db7276d2a device  2 [0x39] Tesla V100-SXM3-32GB
#  Rank  3 Group  0 Pid   1018 on 110db7276d2a device  3 [0x3b] Tesla V100-SXM3-32GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    15.65    0.00    0.00      0    15.51    0.00    0.00      0
          16             4     float     sum      -1    17.06    0.00    0.00      0    15.37    0.00    0.00      0
          32             8     float     sum      -1    18.00    0.00    0.00      0    17.26    0.00    0.00      0
          64            16     float     sum      -1    16.32    0.00    0.01      0    16.30    0.00    0.01      0
         128            32     float     sum      -1    17.82    0.01    0.01      0    15.52    0.01    0.01      0
         256            64     float     sum      -1    17.12    0.01    0.02      0    15.50    0.02    0.02      0
         512           128     float     sum      -1    17.70    0.03    0.04      0    15.39    0.03    0.05      0
        1024           256     float     sum      -1    17.39    0.06    0.09      0    16.08    0.06    0.10      0
        2048           512     float     sum      -1    17.70    0.12    0.17      0    15.55    0.13    0.20      0
        4096          1024     float     sum      -1    17.15    0.24    0.36      0    15.43    0.27    0.40      0
        8192          2048     float     sum      -1    17.06    0.48    0.72      0    16.56    0.49    0.74      0
       16384          4096     float     sum      -1    17.45    0.94    1.41      0    16.01    1.02    1.54      0
       32768          8192     float     sum      -1    18.42    1.78    2.67      0    16.63    1.97    2.96      0
       65536         16384     float     sum      -1    20.16    3.25    4.88      0    18.30    3.58    5.37      0
      131072         32768     float     sum      -1    23.59    5.56    8.33      0    21.62    6.06    9.09      0
      262144         65536     float     sum      -1    27.80    9.43   14.14      0    25.18   10.41   15.61      0
      524288        131072     float     sum      -1    48.59   10.79   16.19      0    48.13   10.89   16.34      0
     1048576        262144     float     sum      -1    60.36   17.37   26.06      0    59.31   17.68   26.52      0
     2097152        524288     float     sum      -1    80.19   26.15   39.23      0    78.58   26.69   40.03      0
     4194304       1048576     float     sum      -1    119.5   35.10   52.65      0    118.5   35.40   53.09      0
     8388608       2097152     float     sum      -1    138.7   60.48   90.72      0    138.1   60.75   91.13      0
    16777216       4194304     float     sum      -1    237.7   70.59  105.89      0    237.6   70.62  105.93      0
    33554432       8388608     float     sum      -1    441.8   75.95  113.93      0    442.0   75.91  113.87      0
    67108864      16777216     float     sum      -1    849.0   79.04  118.57      0    849.5   79.00  118.50      0
   134217728      33554432     float     sum      -1   1667.4   80.49  120.74      0   1667.7   80.48  120.72      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 28.7815

Morganh · May 5, 2023, 4:36pm

Similarly, please run with tao docker as well.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then inside the docker
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

veritable · May 5, 2023, 6:46pm

Both tests in the Tao container run successfully, however there are additional log messages when compared to the output of the tests from the tensorrt container. I’ve included all of the log messages in case they might be significant. The most concerning one to me seems to be NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol., but it seems to recover fine.

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1034 on 4120a457223a device  0 [0x34] Tesla V100-SXM3-32GB
#  Rank  1 Group  0 Pid   1034 on 4120a457223a device  1 [0x36] Tesla V100-SXM3-32GB
#  Rank  2 Group  0 Pid   1034 on 4120a457223a device  2 [0x39] Tesla V100-SXM3-32GB
#  Rank  3 Group  0 Pid   1034 on 4120a457223a device  3 [0x3b] Tesla V100-SXM3-32GB
4120a457223a:1034:1034 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
4120a457223a:1034:1034 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
4120a457223a:1034:1034 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
4120a457223a:1034:1034 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
4120a457223a:1034:1034 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
4120a457223a:1034:1034 [3] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
4120a457223a:1034:1043 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
4120a457223a:1034:1043 [0] NCCL INFO P2P plugin IBext
4120a457223a:1034:1043 [0] NCCL INFO NET/IB : No device found.
4120a457223a:1034:1043 [0] NCCL INFO NET/IB : No device found.
4120a457223a:1034:1043 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
4120a457223a:1034:1043 [0] NCCL INFO Using network Socket
4120a457223a:1034:1044 [1] NCCL INFO Using network Socket
4120a457223a:1034:1045 [2] NCCL INFO Using network Socket
4120a457223a:1034:1046 [3] NCCL INFO Using network Socket
4120a457223a:1034:1043 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
4120a457223a:1034:1045 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
4120a457223a:1034:1044 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
4120a457223a:1034:1046 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
4120a457223a:1034:1045 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1
4120a457223a:1034:1046 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->2 [5] -1/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] -1/-1/-1->3->2 [8] -1/-1/-1->3->2 [9] -1/-1/-1->3->2 [10] -1/-1/-1->3->2 [11] -1/-1/-1->3->2
4120a457223a:1034:1043 [0] NCCL INFO Channel 00/12 :    0   1   2   3
4120a457223a:1034:1044 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0
4120a457223a:1034:1043 [0] NCCL INFO Channel 01/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 02/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 03/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 04/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 05/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 06/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 07/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 08/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 09/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 10/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Channel 11/12 :    0   1   2   3
4120a457223a:1034:1043 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1
4120a457223a:1034:1045 [2] NCCL INFO Channel 00/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 00/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 00/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 00/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 01/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 01/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 01/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 01/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 02/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 02/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 02/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 02/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 03/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 03/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 03/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 03/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 04/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 04/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 04/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 04/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 05/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 05/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 05/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 05/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 06/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 06/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 06/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 06/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 07/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 07/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 07/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 07/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 08/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 08/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 08/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 08/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 09/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 09/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 09/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 09/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 10/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 10/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 10/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 10/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 11/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 11/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 11/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 11/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Connected all rings
4120a457223a:1034:1044 [1] NCCL INFO Connected all rings
4120a457223a:1034:1046 [3] NCCL INFO Connected all rings
4120a457223a:1034:1046 [3] NCCL INFO Channel 00/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Connected all rings
4120a457223a:1034:1046 [3] NCCL INFO Channel 01/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 02/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 03/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 04/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 05/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 06/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 07/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 08/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 09/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 10/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 11/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 00/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 00/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 01/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 01/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 02/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 02/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 03/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 03/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 04/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 04/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 05/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 05/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 06/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 06/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 07/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 07/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 08/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 08/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 09/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 09/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 10/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 10/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 11/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 11/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Connected all trees
4120a457223a:1034:1043 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
4120a457223a:1034:1043 [0] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer
4120a457223a:1034:1044 [1] NCCL INFO Connected all trees
4120a457223a:1034:1044 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
4120a457223a:1034:1046 [3] NCCL INFO Connected all trees
4120a457223a:1034:1046 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
4120a457223a:1034:1045 [2] NCCL INFO Connected all trees
4120a457223a:1034:1045 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
4120a457223a:1034:1044 [1] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer
4120a457223a:1034:1046 [3] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer
4120a457223a:1034:1045 [2] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer
4120a457223a:1034:1044 [1] NCCL INFO comm 0x55d0915427b0 rank 1 nranks 4 cudaDev 1 busId 36000 - Init COMPLETE
4120a457223a:1034:1046 [3] NCCL INFO comm 0x55d08c2527a0 rank 3 nranks 4 cudaDev 3 busId 3b000 - Init COMPLETE
4120a457223a:1034:1045 [2] NCCL INFO comm 0x55d08c24fd10 rank 2 nranks 4 cudaDev 2 busId 39000 - Init COMPLETE
4120a457223a:1034:1043 [0] NCCL INFO comm 0x55d09153fd20 rank 0 nranks 4 cudaDev 0 busId 34000 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    15.42    0.00    0.00      0    15.56    0.00    0.00      0
          16             4     float     sum      -1    16.38    0.00    0.00      0    15.30    0.00    0.00      0
          32             8     float     sum      -1    17.12    0.00    0.00      0    15.54    0.00    0.00      0
          64            16     float     sum      -1    16.29    0.00    0.01      0    16.34    0.00    0.01      0
         128            32     float     sum      -1    16.94    0.01    0.01      0    16.40    0.01    0.01      0
         256            64     float     sum      -1    17.78    0.01    0.02      0    15.43    0.02    0.02      0
         512           128     float     sum      -1    17.77    0.03    0.04      0    15.36    0.03    0.05      0
        1024           256     float     sum      -1    18.00    0.06    0.09      0    16.56    0.06    0.09      0
        2048           512     float     sum      -1    17.12    0.12    0.18      0    16.39    0.12    0.19      0
        4096          1024     float     sum      -1    17.28    0.24    0.36      0    15.51    0.26    0.40      0
        8192          2048     float     sum      -1    17.24    0.48    0.71      0    15.81    0.52    0.78      0
       16384          4096     float     sum      -1    17.43    0.94    1.41      0    16.79    0.98    1.46      0
       32768          8192     float     sum      -1    18.44    1.78    2.66      0    16.72    1.96    2.94      0
       65536         16384     float     sum      -1    19.95    3.29    4.93      0    18.21    3.60    5.40      0
      131072         32768     float     sum      -1    23.24    5.64    8.46      0    21.67    6.05    9.07      0
      262144         65536     float     sum      -1    26.36    9.94   14.92      0    24.93   10.51   15.77      0
      524288        131072     float     sum      -1    48.72   10.76   16.14      0    48.27   10.86   16.29      0
     1048576        262144     float     sum      -1    60.53   17.32   25.99      0    59.17   17.72   26.58      0
     2097152        524288     float     sum      -1    80.30   26.12   39.17      0    78.44   26.73   40.10      0
     4194304       1048576     float     sum      -1    119.7   35.05   52.58      0    118.1   35.50   53.25      0
     8388608       2097152     float     sum      -1    138.4   60.63   90.94      0    137.6   60.94   91.41      0
    16777216       4194304     float     sum      -1    237.3   70.70  106.06      0    237.4   70.68  106.03      0
    33554432       8388608     float     sum      -1    441.1   76.07  114.11      0    441.4   76.01  114.02      0
    67108864      16777216     float     sum      -1    848.3   79.11  118.67      0    848.6   79.08  118.63      0
   134217728      33554432     float     sum      -1   1669.7   80.38  120.57      0   1667.3   80.50  120.75      0
4120a457223a:1034:1034 [3] NCCL INFO comm 0x55d09153fd20 rank 0 nranks 4 cudaDev 0 busId 34000 - Destroy COMPLETE
4120a457223a:1034:1034 [3] NCCL INFO comm 0x55d0915427b0 rank 1 nranks 4 cudaDev 1 busId 36000 - Destroy COMPLETE
4120a457223a:1034:1034 [3] NCCL INFO comm 0x55d08c24fd10 rank 2 nranks 4 cudaDev 2 busId 39000 - Destroy COMPLETE
4120a457223a:1034:1034 [3] NCCL INFO comm 0x55d08c2527a0 rank 3 nranks 4 cudaDev 3 busId 3b000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 28.8257

Morganh · May 6, 2023, 2:04am

OK, could you please login the tao docker and run classification again?

$ docker run --runtime=nvidia -it --rm -v your/local/path:docker/path nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then, run training without “tao”.
# classification train -e /data/e1.cfg -r /results/e1 --gpus 2 -k nvidia_tlt

If issue still happens, please share us with full log. Thanks.

veritable · May 8, 2023, 5:41pm

Hi Morganh, thanks for continuing to look into this. Here’s the log.
log.txt (59.0 KB)

Morganh · May 15, 2023, 2:45am

To narrow down, could you please run below experiments?

Continue with above environment, to run mpi-test with below guide
MPI Hello World · MPI Tutorial
Set up a new environment, to use an old version of tao container.
$ docker run --runtime=nvidia -it --rm -v your/local/path:docker/path nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash
Then, run training without “tao”.
# classification train -e /data/e1.cfg -r /results/e1 --gpus 2 -k nvidia_tlt

veritable · May 15, 2023, 7:47pm

Here’s the output of the Hello World for MPI, which seems to be as expected. I needed to use --allow-run-as-root because I was running as root inside the docker container. I was alternatively able to run the tutorials without --allow-run-as-root by setting the environment variables -e OMPI_ALLOW_RUN_AS_ROOT=1 -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 when starting the docker container, however these flags had no impact on my attempt to train with multiple GPUs.

# mpirun --allow-run-as-root -n 4 ./mpi-hello-world/code/mpi_hello_world
Hello world from processor c5288c2f1e10, rank 1 out of 4 processors
Hello world from processor c5288c2f1e10, rank 0 out of 4 processors
Hello world from processor c5288c2f1e10, rank 2 out of 4 processors
Hello world from processor c5288c2f1e10, rank 3 out of 4 processors

I need more guidance in order to run this test.

When I tried the docker run ... bash for the container you provided, the docker entrypoint failed with chmod: cannot access '/opt/ngccli/ngc': No such file or directory.

It seems that this is a known issue, so I added --entrypoint "" to the docker run command. Unfortunately this means that the container didn’t have classification_tf1 installed.

I tried to pip install nvidia-tao-tf1, however pip could only find nvidia_tao_tf1-4.0.0.657.dev0, which isn’t the same Tao version as the container and seems to be incompatible with this container. Trying to run classification_tf1 -h results in AttributeError: module 'third_party.keras' has no attribute 'mixed_precision'.

Please let me know to best proceed with running this second test.

Morganh · May 16, 2023, 4:35am

You can run classification instead.

veritable · May 16, 2023, 3:19pm

Thanks for the clarification.

I was able to run classification train ... with multiple GPUs using the old Tao Container. (Yay!)

Does this help us understand how to run it on the latest Tao container?

Morganh · May 17, 2023, 6:02am

Similar to above, could you please share the log when you successfully run multi-gpus training with the old tao docker? Thanks a lot.

veritable · May 17, 2023, 1:55pm

Sure, here’s the log.
log_20230517.txt (46.4 KB)

Morganh · May 18, 2023, 3:22am

If it is possible, please kindly help run an experiment to narrow down.

Uninstall 525 :  
                           sudo apt purge nvidia-driver-525
                           sudo apt autoremove
                           sudo apt autoclean

Install old version:    
                           sudo apt install nvidia-driver-510

Then run the 4.0.1 docker to check if it can work.

veritable · May 18, 2023, 5:22pm

After running the commands that you provided to switch to 510, I couldn’t even run tao classification_tf1 -h. I was getting pycuda._driver.LogicError: cuInit failed: system not yet initialized. After trying lots of things, it turns out the fabricmanager-510 was installing 515, and the version mismatch between 510 and 515 was causing problems:

$ systemctl status nvidia-fabricmanager.service
...
sdgx-server nv-fabricmanager[5048]: fabric manager NVIDIA GPU driver interface version 515.105.01 don't match with driver version 510.108.03. Please update with matching NVIDIA driver package.
...
sdgx-server systemd[1]: Failed to start NVIDIA fabric manager service.

As a result, I followed your instructions to install 515 instead, and I still have the same error. For what it’s worth, I was experiencing this same error when this system was running 470, before I upgraded the system to latest and submitted this help request.

Is there a different way for me to get the system running with 510, or should I try an earlier version?

Morganh · May 19, 2023, 12:41am

It is fine to verify with 515.
Not needed to check more earlier versions now.

Since we cannot reproduce with internal V100 machines, we will still need to check the gap further.

Morganh · May 19, 2023, 9:27am

Hi,
Since the mpirun version is not the same between 22.05 docker and 4.0.1 docker, could you help run below experiments if you have bandwidth? Thanks a lot.

In 4.0.1 docker, make the code and run.

code: mpitutorial/tutorials/mpi-broadcast-and-collective-communication/code at gh-pages · mpitutorial/mpitutorial · GitHub
Run: mpirun -n 4 ./my_bcast
Reference: MPI Broadcast and Collective Communication · MPI Tutorial

In 4.0.1 docker, install the mpirun version from 22.05 docker.

 # from https://edu.itp.phys.ethz.ch/hs12/programming_techniques/openmpi.pdf and https://www.open-mpi.org/software/ompi/v4.1/ 
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.bz2
 mkdir src
 mv openmpi-4.1.2.tar.bz2 src/
 cd src/
 tar -jxf openmpi-4.1.2.tar.bz2
 cd openmpi-4.1.2
  ./configure --prefix=$HOME/opt/openmpi
 make -j128 all
 make install
 mpirun --version
 echo "export PATH=\$PATH:\$HOME/opt/openmpi/bin" >> $HOME/.bashrc
 echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:\$HOME/opt/openmpi/lib" >> $HOME/.bashrc
 . ~/.bashrc

Then, use this 4.1.2 version of mpirun to run training again.

mpirun --allow-run-as-root -np 4 python /usr/local/lib/python3.6/dist-packages/iva/makenet/scripts/train.py -e spec.txt -r result -k key

veritable · May 24, 2023, 5:36pm

I was able to run my_bcast.
Running the training using the mpirun 4.1.2 command has worked. Any idea why the training is working for me with MPI 4.1.2, but not with MPI 4.1.5a1?

Note: in addition to the above steps, I also needed to run export OPAL_PREFIX=$HOME/opt/openmpi/, otherwise mpirun was trying to pull in the wrong version of some libraries (mpirun: symbol lookup error...)