Hi,
To narrow down, could you please directly run training with docker with k8s?
Steps:
Generate debug.yaml . An example is as below. It will trigger docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5. And also mount my local folder /localhome/local-morganh/ into the pod.
Currently, we consider another solution.
Could you please exit the debug pod and run below command in your machine? Please share us with the result.
$ sudo lspci -vvv | grep ACSCtl
$ dmesg | grep IOMMU
BUT…I get stuck in the same point in the AUTOML training with 2 GPUS:
INFO:tensorflow:Graph was finalized.
2023-08-07 10:10:07,733 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-08-07 10:10:10,388 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-08-07 10:10:11,018 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-08-07 10:10:21,599 [TAO Toolkit] [INFO] tensorflow 81: Saving checkpoints for step-0.
[2023-08-07 10:11:28. 10180: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [DistributedAdamOptimizer_Allreduce/cond_142/HorovodAllreduce_mul_333_0, DistributedAdamOptimizer_Allreduce/cond_143/HorovodAllreduce_mul_334_0, DistributedAdamOptimizer_Allreduce/cond_144/HorovodAllreduce_mul_335_0, DistributedAdamOptimizer_Allreduce/cond_145/HorovodAllreduce_mul_336_0, DistributedAdamOptimizer_Allreduce/cond_146/HorovodAllreduce_mul_337_0, DistributedAdamOptimizer_Allreduce/cond_147/HorovodAllreduce_mul_338_0 ...]
[2023-08-07 10:12:28. 10561: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_191_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_192_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_201_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_291_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_292_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_293_0 ...]
1: [DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_355_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_356_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_357_0, DistributedAdamOptimizer_Allreduce/cond_103/HorovodAllreduce_mul_358_0, DistributedAdamOptimizer_Allreduce/cond_104/HorovodAllreduce_mul_359_0, DistributedAdamOptimizer_Allreduce/cond_105/HorovodAllreduce_mul_360_0 ...]
Also, I try again the nccl-tests method:
root@debug:/workspace/nccl-tests# export NCCL_DEBUG=TRACE
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1153 on debug device 0 [0x21] NVIDIA RTX 6000 Ada Generation
# Rank 1 Group 0 Pid 1153 on debug device 1 [0x22] NVIDIA RTX 6000 Ada Generation
debug:1153:1153 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.104<0>
debug:1153:1153 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug:1153:1153 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug:1153:1153 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug:1153:1153 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug:1153:1153 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.16.5+cuda12.0
debug:1153:1162 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug:1153:1162 [0] NCCL INFO P2P plugin IBext
debug:1153:1162 [0] NCCL INFO NET/IB : No device found.
debug:1153:1162 [0] NCCL INFO NET/IB : No device found.
debug:1153:1162 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.104<0>
debug:1153:1162 [0] NCCL INFO Using network Socket
debug:1153:1163 [1] NCCL INFO Using network Socket
debug:1153:1162 [0] NCCL INFO Channel 00/04 : 0 1
debug:1153:1162 [0] NCCL INFO Channel 01/04 : 0 1
debug:1153:1163 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
debug:1153:1163 [1] NCCL INFO P2P Chunksize set to 131072
debug:1153:1162 [0] NCCL INFO Channel 02/04 : 0 1
debug:1153:1162 [0] NCCL INFO Channel 03/04 : 0 1
debug:1153:1162 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
debug:1153:1162 [0] NCCL INFO P2P Chunksize set to 131072
debug:1153:1162 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1153:1163 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1153:1162 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1153:1163 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1153:1162 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1153:1163 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1153:1162 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1153:1163 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1153:1162 [0] NCCL INFO Connected all rings
debug:1153:1162 [0] NCCL INFO Connected all trees
debug:1153:1163 [1] NCCL INFO Connected all rings
debug:1153:1163 [1] NCCL INFO Connected all trees
debug:1153:1163 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
debug:1153:1163 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1153:1162 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
debug:1153:1162 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1153:1162 [0] NCCL INFO comm 0x55b033df9cd0 rank 0 nranks 2 cudaDev 0 busId 21000 commId 0x78df2eb5ffee4e8d - Init COMPLETE
debug:1153:1163 [1] NCCL INFO comm 0x55b033dfe420 rank 1 nranks 2 cudaDev 1 busId 22000 commId 0x78df2eb5ffee4e8d - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 7.42 0.00 0.00 0 6.18 0.00 0.00 0
16 4 float sum -1 6.71 0.00 0.00 0 6.06 0.00 0.00 0
32 8 float sum -1 6.19 0.01 0.01 0 6.09 0.01 0.01 0
64 16 float sum -1 6.14 0.01 0.01 0 6.13 0.01 0.01 0
128 32 float sum -1 6.12 0.02 0.02 0 6.08 0.02 0.02 0
256 64 float sum -1 6.12 0.04 0.04 0 6.07 0.04 0.04 0
512 128 float sum -1 9.43 0.05 0.05 0 6.15 0.08 0.08 0
1024 256 float sum -1 6.15 0.17 0.17 0 6.10 0.17 0.17 0
2048 512 float sum -1 6.21 0.33 0.33 0 6.10 0.34 0.34 0
4096 1024 float sum -1 6.33 0.65 0.65 0 6.22 0.66 0.66 0
8192 2048 float sum -1 6.71 1.22 1.22 0 6.67 1.23 1.23 0
16384 4096 float sum -1 7.47 2.19 2.19 0 7.37 2.22 2.22 0
32768 8192 float sum -1 8.85 3.70 3.70 0 8.79 3.73 3.73 0
65536 16384 float sum -1 12.06 5.43 5.43 0 12.03 5.45 5.45 0
131072 32768 float sum -1 25.03 5.24 5.24 0 24.88 5.27 5.27 0
262144 65536 float sum -1 34.98 7.49 7.49 0 38.05 6.89 6.89 0
524288 131072 float sum -1 49.06 10.69 10.69 0 49.49 10.59 10.59 0
1048576 262144 float sum -1 81.77 12.82 12.82 0 64.79 16.18 16.18 0
2097152 524288 float sum -1 108.2 19.39 19.39 0 107.2 19.56 19.56 0
4194304 1048576 float sum -1 197.9 21.19 21.19 0 197.6 21.22 21.22 0
8388608 2097152 float sum -1 386.3 21.72 21.72 0 384.2 21.84 21.84 0
16777216 4194304 float sum -1 762.6 22.00 22.00 0 760.0 22.08 22.08 0
33554432 8388608 float sum -1 1513.3 22.17 22.17 0 1512.6 22.18 22.18 0
67108864 16777216 float sum -1 3012.9 22.27 22.27 0 3011.5 22.28 22.28 0
134217728 33554432 float sum -1 6018.4 22.30 22.30 0 6014.8 22.31 22.31 0
debug:1153:1153 [1] NCCL INFO comm 0x55b033df9cd0 rank 0 nranks 2 cudaDev 0 busId 21000 - Destroy COMPLETE
debug:1153:1153 [1] NCCL INFO comm 0x55b033dfe420 rank 1 nranks 2 cudaDev 1 busId 22000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 8.10962
#
I think that it’s good. No export variables. Open the pod, clone the repo, compile and launch.
OK, so nccl-test is fine.
For detectnet_v2 training, to narrow down, could you please generate a new debug pod for 4.0.1 or 22.05 and test in it? nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
AArrrr, all this time lost by the IO virtualization…
LOG
root@debug-tao4:/workspace/nccl-tests# export NCCL_DEBUG=TRACE
root@debug-tao4:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 983 on debug-tao4 device 0 [0x21] NVIDIA RTX 6000 Ada Generation
# Rank 1 Group 0 Pid 983 on debug-tao4 device 1 [0x22] NVIDIA RTX 6000 Ada Generation
debug-tao4:983:983 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.103<0>
debug-tao4:983:983 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug-tao4:983:983 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug-tao4:983:983 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug-tao4:983:983 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug-tao4:983:983 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
debug-tao4:983:992 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug-tao4:983:992 [0] NCCL INFO P2P plugin IBext
debug-tao4:983:992 [0] NCCL INFO NET/IB : No device found.
debug-tao4:983:992 [0] NCCL INFO NET/IB : No device found.
debug-tao4:983:992 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.103<0>
debug-tao4:983:992 [0] NCCL INFO Using network Socket
debug-tao4:983:993 [1] NCCL INFO Using network Socket
debug-tao4:983:992 [0] NCCL INFO Channel 00/04 : 0 1
debug-tao4:983:992 [0] NCCL INFO Channel 01/04 : 0 1
debug-tao4:983:992 [0] NCCL INFO Channel 02/04 : 0 1
debug-tao4:983:992 [0] NCCL INFO Channel 03/04 : 0 1
debug-tao4:983:993 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
debug-tao4:983:992 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
debug-tao4:983:992 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug-tao4:983:993 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug-tao4:983:992 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug-tao4:983:993 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug-tao4:983:992 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug-tao4:983:993 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug-tao4:983:992 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug-tao4:983:993 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug-tao4:983:992 [0] NCCL INFO Connected all rings
debug-tao4:983:993 [1] NCCL INFO Connected all rings
debug-tao4:983:992 [0] NCCL INFO Connected all trees
debug-tao4:983:993 [1] NCCL INFO Connected all trees
debug-tao4:983:993 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
debug-tao4:983:993 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug-tao4:983:992 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
debug-tao4:983:992 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug-tao4:983:993 [1] NCCL INFO comm 0x561be0c3db50 rank 1 nranks 2 cudaDev 1 busId 22000 - Init COMPLETE
debug-tao4:983:992 [0] NCCL INFO comm 0x561be0c3b0c0 rank 0 nranks 2 cudaDev 0 busId 21000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 6.05 0.00 0.00 0 6.10 0.00 0.00 0
16 4 float sum -1 7.36 0.00 0.00 0 6.08 0.00 0.00 0
32 8 float sum -1 7.50 0.00 0.00 0 6.06 0.01 0.01 0
64 16 float sum -1 7.60 0.01 0.01 0 6.03 0.01 0.01 0
128 32 float sum -1 7.36 0.02 0.02 0 6.03 0.02 0.02 0
256 64 float sum -1 7.72 0.03 0.03 0 6.02 0.04 0.04 0
512 128 float sum -1 7.37 0.07 0.07 0 6.07 0.08 0.08 0
1024 256 float sum -1 7.68 0.13 0.13 0 6.06 0.17 0.17 0
2048 512 float sum -1 7.40 0.28 0.28 0 6.21 0.33 0.33 0
4096 1024 float sum -1 7.85 0.52 0.52 0 6.24 0.66 0.66 0
8192 2048 float sum -1 7.62 1.08 1.08 0 6.62 1.24 1.24 0
16384 4096 float sum -1 8.51 1.93 1.93 0 7.31 2.24 2.24 0
32768 8192 float sum -1 9.29 3.53 3.53 0 8.71 3.76 3.76 0
65536 16384 float sum -1 11.79 5.56 5.56 0 11.91 5.50 5.50 0
131072 32768 float sum -1 24.35 5.38 5.38 0 24.33 5.39 5.39 0
262144 65536 float sum -1 37.78 6.94 6.94 0 37.59 6.97 6.97 0
524288 131072 float sum -1 51.90 10.10 10.10 0 54.64 9.59 9.59 0
1048576 262144 float sum -1 78.71 13.32 13.32 0 65.04 16.12 16.12 0
2097152 524288 float sum -1 108.0 19.41 19.41 0 107.8 19.46 19.46 0
4194304 1048576 float sum -1 198.3 21.15 21.15 0 197.3 21.26 21.26 0
8388608 2097152 float sum -1 383.6 21.87 21.87 0 383.1 21.90 21.90 0
16777216 4194304 float sum -1 760.2 22.07 22.07 0 759.4 22.09 22.09 0
33554432 8388608 float sum -1 1510.3 22.22 22.22 0 1507.3 22.26 22.26 0
67108864 16777216 float sum -1 3014.7 22.26 22.26 0 3004.5 22.34 22.34 0
134217728 33554432 float sum -1 6008.4 22.34 22.34 0 6000.4 22.37 22.37 0
debug-tao4:983:983 [1] NCCL INFO comm 0x561be0c3b0c0 rank 0 nranks 2 cudaDev 0 busId 21000 - Destroy COMPLETE
debug-tao4:983:983 [1] NCCL INFO comm 0x561be0c3db50 rank 1 nranks 2 cudaDev 1 busId 22000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 8.0803
#
Need more time to create the tfrecords with the container, export the train specs and test a train with that…
NOTE: I will edit the before post with the export NCCL_DEBUG=TRACE to compare both TAO.
NOTE2: Also try in TAO5 the normal TRAIN without autoML and the errors about horovod are the same. No process in the training
TAO5.
Deploy the pod.
Generate the tfrecods.
Launch the train.
WAIT ITS WORKING- In this edit, modify the image_extension from: image_extension: ".jpg" to image_extension: "jpg", and now start the training.
Attach a new log. log_tao5_train_working.txt (138.5 KB)
OLD
FAIL → Try to search pictures that exist, but adding a extra dot in the file extension.
(1) Not found: /workspace/tao-experiments/GLOBAL_DATASET/training_v04/training/images/synt_seq2_10203_resize_bright..jpg; No such file or directory
Glad to know the 2gpus training without TAO API can work.
To narrow down, for 2gpus training with TAO API, could you set below specs["evaluation_config"]["first_validation_epoch"] = 10
to specs["evaluation_config"]["first_validation_epoch"] = 99
Since below is a generic horovod error which usually comes because one of the ranks is still running a process and the other ones are waiting for it. Usually happens when running evaluation or model checkpointing which runs only on rank-0 and may take a little while.
$ kubectl delete pod debug
$ kubectl apply -f debug-1.yaml
Then, check the log via
$ kubectl logs -f debug-1
Exp2
Please run yolov_4 network. Please generate yolo_v4_spec.txt. You can refer to the specs in the notebook. Do not care the parameters such as big_anchor_shape. We just want to make sure the yolov4 can also work with 2gpus.
$ kubectl delete pod debug-1
$ kubectl apply -f debug-yolov4.yaml
Then, check the log via
$ kubectl logs -f debug-yolov4
Exp3
If experiment2 works, please run yolov4 network with notebooks/tao_api_starter_kit/api/object_detection.ipynb.
I recall you run some other notebooks with 2gpus. Besides detectnet_v2, can you confirm which network or notebook can run with 2gpus?
Exp4
In notebooks/tao_api_starter_kit/api/object_detection.ipynb , please run detectnet_v2 network again. When you run into the cell which is running “detectnet_v2 train”, usually a new pod will prompt in “kubectl get pods”. Please describe it via “kubectl describe pod this-pod-name” and share the log with us. We need to check what the exact commands they are.
Late, just kill the pod to start other experiment.
So. I think that found where the issue comes. Checking the specs between the manual detecnet_v2 train and automatic pod train, the only one is the specs["training_config"]["visualizer"]["enabled"] = True
I am a little confused. Last Friday, you can run successfully with the debug pod(apply for my debug.yaml). So, how about the specs["training_config"]["visualizer"]["enabled"] when you run debug pod?
Yes, this point is specific from the spec file.
The point is that the spec file used in this debug pod, have this parameter OFF.
Attach extract from the spec file used in the pod:
This is the only important difference between both iterations.
In the API pod have this point enabled because i want to start using clear.ml and wandb to analyse the trainings.