I can train with a single GPU without any issues, but when I try to train with more than one GPU, I get the error:
cma_ep.c:81 process_vm_readv(pid=650 {0x7f7cf45dd3d0,16569}-->{0x7fe11c5efd28,16569}) returned -1: Operation not permitted
The error occurs immediately after the first batch of the first epoch.
The following TAO command runs fine with --gpus set to 1, but I get the above error when I set it to 2 or more:
tao classification_tf1 train -e /data/e1.cfg -r /results/e1 --gpus 1 -k nvidia_tlt
• Hardware: DGX-2 (16x V100s)
• Network Type: classification_tf1
• TAO Version:
Configuration of the TAO Toolkit Instance
dockers:
nvidia/tao/tao-toolkit:
4.0.0-tf2.9.1:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
4.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification_tf1
4. detectnet_v2
5. dssd
6. emotionnet
7. efficientdet_tf1
8. faster_rcnn
9. fpenet
10. gazenet
11. gesturenet
12. heartratenet
13. lprnet
14. mask_rcnn
15. multitask_classification
16. retinanet
17. ssd
18. unet
19. yolo_v3
20. yolo_v4
21. yolo_v4_tiny
22. converter
4.0.1-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. mask_rcnn
2. unet
4.0.0-pyt:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. deformable_detr
3. segformer
4. re_identification
5. pointpillars
6. pose_classification
7. n_gram
8. speech_to_text
9. speech_to_text_citrinet
10. speech_to_text_conformer
11. spectro_gen
12. vocoder
13. text_classification
14. question_answering
15. token_classification
16. intent_slot_classification
17. punctuation_and_capitalization
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023
Here is the output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM3... On | 00000000:34:00.0 Off | 0 |
| N/A 26C P0 49W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM3... On | 00000000:36:00.0 Off | 0 |
| N/A 26C P0 49W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM3... On | 00000000:39:00.0 Off | 0 |
| N/A 32C P0 50W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM3... On | 00000000:3B:00.0 Off | 0 |
| N/A 31C P0 49W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 |
| N/A 26C P0 47W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM3... On | 00000000:59:00.0 Off | 0 |
| N/A 32C P0 49W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM3... On | 00000000:5C:00.0 Off | 0 |
| N/A 27C P0 47W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 |
| N/A 31C P0 50W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 8 Tesla V100-SXM3... On | 00000000:B7:00.0 Off | 0 |
| N/A 31C P0 49W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 9 Tesla V100-SXM3... On | 00000000:B9:00.0 Off | 0 |
| N/A 28C P0 48W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 10 Tesla V100-SXM3... On | 00000000:BC:00.0 Off | 0 |
| N/A 34C P0 48W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 11 Tesla V100-SXM3... On | 00000000:BE:00.0 Off | 0 |
| N/A 32C P0 49W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 12 Tesla V100-SXM3... On | 00000000:E0:00.0 Off | 0 |
| N/A 31C P0 48W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 13 Tesla V100-SXM3... On | 00000000:E2:00.0 Off | 0 |
| N/A 30C P0 48W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 14 Tesla V100-SXM3... On | 00000000:E5:00.0 Off | 0 |
| N/A 35C P0 48W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 |
| N/A 36C P0 50W / 350W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
To narrow down, please login 4.0.1 docker directly and install old version of nccl as below. Then run training again. Thanks.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt install libnccl2=2.11.4-1+cuda11.6 libnccl-dev=2.11.4-1+cuda11.6
ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'
Step 4 “add-apt-repository” returned a python error: “ModuleNotFoundError: No module named ‘apt_pkg’”. Please let me know if I should retry some modification of the first 4 steps.
After restarting my docker container and running only the last 3 commands (skipping the first 4), it appears to have worked, and here’s the output of the last step:
# ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'
/sbin/ldconfig.real: Path `/usr/local/cuda-11/targets/x86_64-linux/lib' given more than once
/sbin/ldconfig.real: Path `/usr/local/cuda/lib64' given more than once
/sbin/ldconfig.real: Can't stat /usr/local/nvidia/lib: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/nvidia/lib64: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/lib/x86_64-linux-gnu: No such file or directory
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/usr/lib' given more than once
/sbin/ldconfig.real: /lib/x86_64-linux-gnu/ld-2.31.so is the dynamic linker, ignoring
2.11.4
Unfortunately I am still getting the same error when I try to run my training.
Can you run below experiments to check if multi gpus work?
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tensorrt:22.11-py3 /bin/bash
Then inside the docker
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
Both tests ran successfully with similar output.
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1018 on 110db7276d2a device 0 [0x34] Tesla V100-SXM3-32GB
# Rank 1 Group 0 Pid 1018 on 110db7276d2a device 1 [0x36] Tesla V100-SXM3-32GB
# Rank 2 Group 0 Pid 1018 on 110db7276d2a device 2 [0x39] Tesla V100-SXM3-32GB
# Rank 3 Group 0 Pid 1018 on 110db7276d2a device 3 [0x3b] Tesla V100-SXM3-32GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 15.65 0.00 0.00 0 15.51 0.00 0.00 0
16 4 float sum -1 17.06 0.00 0.00 0 15.37 0.00 0.00 0
32 8 float sum -1 18.00 0.00 0.00 0 17.26 0.00 0.00 0
64 16 float sum -1 16.32 0.00 0.01 0 16.30 0.00 0.01 0
128 32 float sum -1 17.82 0.01 0.01 0 15.52 0.01 0.01 0
256 64 float sum -1 17.12 0.01 0.02 0 15.50 0.02 0.02 0
512 128 float sum -1 17.70 0.03 0.04 0 15.39 0.03 0.05 0
1024 256 float sum -1 17.39 0.06 0.09 0 16.08 0.06 0.10 0
2048 512 float sum -1 17.70 0.12 0.17 0 15.55 0.13 0.20 0
4096 1024 float sum -1 17.15 0.24 0.36 0 15.43 0.27 0.40 0
8192 2048 float sum -1 17.06 0.48 0.72 0 16.56 0.49 0.74 0
16384 4096 float sum -1 17.45 0.94 1.41 0 16.01 1.02 1.54 0
32768 8192 float sum -1 18.42 1.78 2.67 0 16.63 1.97 2.96 0
65536 16384 float sum -1 20.16 3.25 4.88 0 18.30 3.58 5.37 0
131072 32768 float sum -1 23.59 5.56 8.33 0 21.62 6.06 9.09 0
262144 65536 float sum -1 27.80 9.43 14.14 0 25.18 10.41 15.61 0
524288 131072 float sum -1 48.59 10.79 16.19 0 48.13 10.89 16.34 0
1048576 262144 float sum -1 60.36 17.37 26.06 0 59.31 17.68 26.52 0
2097152 524288 float sum -1 80.19 26.15 39.23 0 78.58 26.69 40.03 0
4194304 1048576 float sum -1 119.5 35.10 52.65 0 118.5 35.40 53.09 0
8388608 2097152 float sum -1 138.7 60.48 90.72 0 138.1 60.75 91.13 0
16777216 4194304 float sum -1 237.7 70.59 105.89 0 237.6 70.62 105.93 0
33554432 8388608 float sum -1 441.8 75.95 113.93 0 442.0 75.91 113.87 0
67108864 16777216 float sum -1 849.0 79.04 118.57 0 849.5 79.00 118.50 0
134217728 33554432 float sum -1 1667.4 80.49 120.74 0 1667.7 80.48 120.72 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 28.7815
Similarly, please run with tao docker as well.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash
Then inside the docker
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
Both tests in the Tao container run successfully, however there are additional log messages when compared to the output of the tests from the tensorrt container. I’ve included all of the log messages in case they might be significant. The most concerning one to me seems to be NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
, but it seems to recover fine.
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1034 on 4120a457223a device 0 [0x34] Tesla V100-SXM3-32GB
# Rank 1 Group 0 Pid 1034 on 4120a457223a device 1 [0x36] Tesla V100-SXM3-32GB
# Rank 2 Group 0 Pid 1034 on 4120a457223a device 2 [0x39] Tesla V100-SXM3-32GB
# Rank 3 Group 0 Pid 1034 on 4120a457223a device 3 [0x3b] Tesla V100-SXM3-32GB
4120a457223a:1034:1034 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
4120a457223a:1034:1034 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
4120a457223a:1034:1034 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
4120a457223a:1034:1034 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
4120a457223a:1034:1034 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
4120a457223a:1034:1034 [3] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
4120a457223a:1034:1043 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
4120a457223a:1034:1043 [0] NCCL INFO P2P plugin IBext
4120a457223a:1034:1043 [0] NCCL INFO NET/IB : No device found.
4120a457223a:1034:1043 [0] NCCL INFO NET/IB : No device found.
4120a457223a:1034:1043 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
4120a457223a:1034:1043 [0] NCCL INFO Using network Socket
4120a457223a:1034:1044 [1] NCCL INFO Using network Socket
4120a457223a:1034:1045 [2] NCCL INFO Using network Socket
4120a457223a:1034:1046 [3] NCCL INFO Using network Socket
4120a457223a:1034:1043 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
4120a457223a:1034:1045 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
4120a457223a:1034:1044 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
4120a457223a:1034:1046 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
4120a457223a:1034:1045 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1
4120a457223a:1034:1046 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->2 [5] -1/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] -1/-1/-1->3->2 [8] -1/-1/-1->3->2 [9] -1/-1/-1->3->2 [10] -1/-1/-1->3->2 [11] -1/-1/-1->3->2
4120a457223a:1034:1043 [0] NCCL INFO Channel 00/12 : 0 1 2 3
4120a457223a:1034:1044 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0
4120a457223a:1034:1043 [0] NCCL INFO Channel 01/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 02/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 03/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 04/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 05/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 06/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 07/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 08/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 09/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 10/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Channel 11/12 : 0 1 2 3
4120a457223a:1034:1043 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1
4120a457223a:1034:1045 [2] NCCL INFO Channel 00/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 00/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 00/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 00/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 01/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 01/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 01/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 01/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 02/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 02/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 02/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 02/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 03/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 03/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 03/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 03/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 04/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 04/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 04/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 04/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 05/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 05/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 05/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 05/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 06/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 06/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 06/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 06/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 07/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 07/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 07/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 07/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 08/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 08/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 08/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 08/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 09/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 09/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 09/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 09/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 10/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 10/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 10/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 10/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 11/0 : 2[39000] -> 3[3b000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 11/0 : 3[3b000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 11/0 : 1[36000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Channel 11/0 : 0[34000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Connected all rings
4120a457223a:1034:1044 [1] NCCL INFO Connected all rings
4120a457223a:1034:1046 [3] NCCL INFO Connected all rings
4120a457223a:1034:1046 [3] NCCL INFO Channel 00/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Connected all rings
4120a457223a:1034:1046 [3] NCCL INFO Channel 01/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 02/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 03/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 04/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 05/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 06/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 07/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 08/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 09/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 10/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1046 [3] NCCL INFO Channel 11/0 : 3[3b000] -> 2[39000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 00/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 00/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 01/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 01/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 02/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 02/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 03/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 03/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 04/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 04/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 05/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 05/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 06/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 06/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 07/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 07/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 08/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 08/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 09/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 09/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 10/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 10/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1044 [1] NCCL INFO Channel 11/0 : 1[36000] -> 0[34000] via P2P/direct pointer
4120a457223a:1034:1045 [2] NCCL INFO Channel 11/0 : 2[39000] -> 1[36000] via P2P/direct pointer
4120a457223a:1034:1043 [0] NCCL INFO Connected all trees
4120a457223a:1034:1043 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
4120a457223a:1034:1043 [0] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer
4120a457223a:1034:1044 [1] NCCL INFO Connected all trees
4120a457223a:1034:1044 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
4120a457223a:1034:1046 [3] NCCL INFO Connected all trees
4120a457223a:1034:1046 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
4120a457223a:1034:1045 [2] NCCL INFO Connected all trees
4120a457223a:1034:1045 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
4120a457223a:1034:1044 [1] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer
4120a457223a:1034:1046 [3] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer
4120a457223a:1034:1045 [2] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer
4120a457223a:1034:1044 [1] NCCL INFO comm 0x55d0915427b0 rank 1 nranks 4 cudaDev 1 busId 36000 - Init COMPLETE
4120a457223a:1034:1046 [3] NCCL INFO comm 0x55d08c2527a0 rank 3 nranks 4 cudaDev 3 busId 3b000 - Init COMPLETE
4120a457223a:1034:1045 [2] NCCL INFO comm 0x55d08c24fd10 rank 2 nranks 4 cudaDev 2 busId 39000 - Init COMPLETE
4120a457223a:1034:1043 [0] NCCL INFO comm 0x55d09153fd20 rank 0 nranks 4 cudaDev 0 busId 34000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 15.42 0.00 0.00 0 15.56 0.00 0.00 0
16 4 float sum -1 16.38 0.00 0.00 0 15.30 0.00 0.00 0
32 8 float sum -1 17.12 0.00 0.00 0 15.54 0.00 0.00 0
64 16 float sum -1 16.29 0.00 0.01 0 16.34 0.00 0.01 0
128 32 float sum -1 16.94 0.01 0.01 0 16.40 0.01 0.01 0
256 64 float sum -1 17.78 0.01 0.02 0 15.43 0.02 0.02 0
512 128 float sum -1 17.77 0.03 0.04 0 15.36 0.03 0.05 0
1024 256 float sum -1 18.00 0.06 0.09 0 16.56 0.06 0.09 0
2048 512 float sum -1 17.12 0.12 0.18 0 16.39 0.12 0.19 0
4096 1024 float sum -1 17.28 0.24 0.36 0 15.51 0.26 0.40 0
8192 2048 float sum -1 17.24 0.48 0.71 0 15.81 0.52 0.78 0
16384 4096 float sum -1 17.43 0.94 1.41 0 16.79 0.98 1.46 0
32768 8192 float sum -1 18.44 1.78 2.66 0 16.72 1.96 2.94 0
65536 16384 float sum -1 19.95 3.29 4.93 0 18.21 3.60 5.40 0
131072 32768 float sum -1 23.24 5.64 8.46 0 21.67 6.05 9.07 0
262144 65536 float sum -1 26.36 9.94 14.92 0 24.93 10.51 15.77 0
524288 131072 float sum -1 48.72 10.76 16.14 0 48.27 10.86 16.29 0
1048576 262144 float sum -1 60.53 17.32 25.99 0 59.17 17.72 26.58 0
2097152 524288 float sum -1 80.30 26.12 39.17 0 78.44 26.73 40.10 0
4194304 1048576 float sum -1 119.7 35.05 52.58 0 118.1 35.50 53.25 0
8388608 2097152 float sum -1 138.4 60.63 90.94 0 137.6 60.94 91.41 0
16777216 4194304 float sum -1 237.3 70.70 106.06 0 237.4 70.68 106.03 0
33554432 8388608 float sum -1 441.1 76.07 114.11 0 441.4 76.01 114.02 0
67108864 16777216 float sum -1 848.3 79.11 118.67 0 848.6 79.08 118.63 0
134217728 33554432 float sum -1 1669.7 80.38 120.57 0 1667.3 80.50 120.75 0
4120a457223a:1034:1034 [3] NCCL INFO comm 0x55d09153fd20 rank 0 nranks 4 cudaDev 0 busId 34000 - Destroy COMPLETE
4120a457223a:1034:1034 [3] NCCL INFO comm 0x55d0915427b0 rank 1 nranks 4 cudaDev 1 busId 36000 - Destroy COMPLETE
4120a457223a:1034:1034 [3] NCCL INFO comm 0x55d08c24fd10 rank 2 nranks 4 cudaDev 2 busId 39000 - Destroy COMPLETE
4120a457223a:1034:1034 [3] NCCL INFO comm 0x55d08c2527a0 rank 3 nranks 4 cudaDev 3 busId 3b000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 28.8257
OK, could you please login the tao docker and run classification again?
$ docker run --runtime=nvidia -it --rm -v your/local/path:docker/path nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash
Then, run training without “tao”.
#
classification train -e /data/e1.cfg -r /results/e1 --gpus 2 -k nvidia_tlt
If issue still happens, please share us with full log. Thanks.
Hi Morganh, thanks for continuing to look into this. Here’s the log.
log.txt (59.0 KB)
To narrow down, could you please run below experiments?
-
Continue with above environment, to run mpi-test with below guide
MPI Hello World · MPI Tutorial
-
Set up a new environment, to use an old version of tao container.
$ docker run --runtime=nvidia -it --rm -v your/local/path:docker/path nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash
Then, run training without “tao”.
#
classification train -e /data/e1.cfg -r /results/e1 --gpus 2 -k nvidia_tlt
- Here’s the output of the Hello World for MPI, which seems to be as expected. I needed to use
--allow-run-as-root
because I was running as root inside the docker container. I was alternatively able to run the tutorials without --allow-run-as-root
by setting the environment variables -e OMPI_ALLOW_RUN_AS_ROOT=1 -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
when starting the docker container, however these flags had no impact on my attempt to train with multiple GPUs.
# mpirun --allow-run-as-root -n 4 ./mpi-hello-world/code/mpi_hello_world
Hello world from processor c5288c2f1e10, rank 1 out of 4 processors
Hello world from processor c5288c2f1e10, rank 0 out of 4 processors
Hello world from processor c5288c2f1e10, rank 2 out of 4 processors
Hello world from processor c5288c2f1e10, rank 3 out of 4 processors
- I need more guidance in order to run this test.
When I tried the docker run ... bash
for the container you provided, the docker entrypoint failed with chmod: cannot access '/opt/ngccli/ngc': No such file or directory
.
It seems that this is a known issue, so I added --entrypoint ""
to the docker run command. Unfortunately this means that the container didn’t have classification_tf1
installed.
I tried to pip install nvidia-tao-tf1
, however pip could only find nvidia_tao_tf1-4.0.0.657.dev0, which isn’t the same Tao version as the container and seems to be incompatible with this container. Trying to run classification_tf1 -h
results in AttributeError: module 'third_party.keras' has no attribute 'mixed_precision'
.
Please let me know to best proceed with running this second test.
You can run classification
instead.
Thanks for the clarification.
I was able to run classification train ...
with multiple GPUs using the old Tao Container. (Yay!)
Does this help us understand how to run it on the latest Tao container?
Similar to above, could you please share the log when you successfully run multi-gpus training with the old tao docker? Thanks a lot.
Sure, here’s the log.
log_20230517.txt (46.4 KB)
If it is possible, please kindly help run an experiment to narrow down.
Uninstall 525 :
sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean
Install old version:
sudo apt install nvidia-driver-510
Then run the 4.0.1 docker to check if it can work.
After running the commands that you provided to switch to 510, I couldn’t even run tao classification_tf1 -h
. I was getting pycuda._driver.LogicError: cuInit failed: system not yet initialized
. After trying lots of things, it turns out the fabricmanager-510 was installing 515, and the version mismatch between 510 and 515 was causing problems:
$ systemctl status nvidia-fabricmanager.service
...
sdgx-server nv-fabricmanager[5048]: fabric manager NVIDIA GPU driver interface version 515.105.01 don't match with driver version 510.108.03. Please update with matching NVIDIA driver package.
...
sdgx-server systemd[1]: Failed to start NVIDIA fabric manager service.
As a result, I followed your instructions to install 515 instead, and I still have the same error. For what it’s worth, I was experiencing this same error when this system was running 470, before I upgraded the system to latest and submitted this help request.
Is there a different way for me to get the system running with 510, or should I try an earlier version?
It is fine to verify with 515.
Not needed to check more earlier versions now.
Since we cannot reproduce with internal V100 machines, we will still need to check the gap further.
Hi,
Since the mpirun version is not the same between 22.05 docker and 4.0.1 docker, could you help run below experiments if you have bandwidth? Thanks a lot.
- In 4.0.1 docker, make the code and run.
code: mpitutorial/tutorials/mpi-broadcast-and-collective-communication/code at gh-pages · mpitutorial/mpitutorial · GitHub
Run: mpirun -n 4 ./my_bcast
Reference: MPI Broadcast and Collective Communication · MPI Tutorial
- In 4.0.1 docker, install the mpirun version from 22.05 docker.
# from https://edu.itp.phys.ethz.ch/hs12/programming_techniques/openmpi.pdf and https://www.open-mpi.org/software/ompi/v4.1/
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.bz2
mkdir src
mv openmpi-4.1.2.tar.bz2 src/
cd src/
tar -jxf openmpi-4.1.2.tar.bz2
cd openmpi-4.1.2
./configure --prefix=$HOME/opt/openmpi
make -j128 all
make install
mpirun --version
echo "export PATH=\$PATH:\$HOME/opt/openmpi/bin" >> $HOME/.bashrc
echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:\$HOME/opt/openmpi/lib" >> $HOME/.bashrc
. ~/.bashrc
Then, use this 4.1.2 version of mpirun to run training again.
mpirun --allow-run-as-root -np 4 python /usr/local/lib/python3.6/dist-packages/iva/makenet/scripts/train.py -e spec.txt -r result -k key
-
I was able to run my_bcast
.
-
Running the training using the mpirun 4.1.2 command has worked. Any idea why the training is working for me with MPI 4.1.2
, but not with MPI 4.1.5a1
?
Note: in addition to the above steps, I also needed to run export OPAL_PREFIX=$HOME/opt/openmpi/
, otherwise mpirun was trying to pull in the wrong version of some libraries (mpirun: symbol lookup error...
)