TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck

Hi,
To narrow down, could you please directly run training with docker with k8s?
Steps:

  1. Generate debug.yaml . An example is as below. It will trigger docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5. And also mount my local folder /localhome/local-morganh/ into the pod.
$ cat debug.yaml
apiVersion: v1
kind: Pod
metadata:
  name: debug
spec:
  restartPolicy: OnFailure
  containers:
  - name: "detectnetv2"
    image: "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5"
    command: ["sleep"]
    args: ["3600000"]
    resources:
      limits:
         nvidia.com/gpu: 2
    volumeMounts:
    - name: "my-workdir"
      mountPath: "/my-workdir"
  volumes:
  - name: my-workdir
    hostPath:
      path: /localhome/local-morganh/
  1. Then, trigger debug pod.
    $ kubectl apply -f debug.yaml

  2. Then enter the pod. And run training, etc.

     $ kubectl exec -it debug -- /bin/bash
    
     root@debug:/workspace# nvidia-smi
     root@debug:/workspace# ls /my-workdir
     root@debug:/workspace# detectnet_v2 train -e <path_to_spec_file>   -r result    -k key    --gpus 2
    
1 Like

That’s what i want to read.

Try and I ping you.

root@debug:/workspace# nvidia-smi
Fri Aug  4 09:35:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  On   | 00000000:21:00.0 Off |                  Off |
| 30%   32C    P8    24W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   36C    P8    28W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

First surprise, suspect that the bug will be similar to the TAO4…

Do the similar steps inside the debug pod.

root@debug:/workspace# git clone https://github.com/NVIDIA/nccl-tests.git
Cloning into 'nccl-tests'...
remote: Enumerating objects: 324, done.
remote: Counting objects: 100% (202/202), done.
remote: Compressing objects: 100% (75/75), done.
remote: Total 324 (delta 174), reused 129 (delta 127), pack-reused 122
Receiving objects: 100% (324/324), 117.10 KiB | 718.00 KiB/s, done.
Resolving deltas: 100% (213/213), done.

root@debug:/workspace# cd nccl-tests/

root@debug:/workspace/nccl-tests# make
make -C src build BUILDDIR=/workspace/nccl-tests/build
make[1]: Entering directory '/workspace/nccl-tests/src'
Compiling  timer.cc                            > /workspace/nccl-tests/build/timer.o
Compiling /workspace/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/nccl-tests/build/common.o
Linking  /workspace/nccl-tests/build/all_reduce.o > /workspace/nccl-tests/build/all_reduce_perf
Compiling  all_gather.cu                       > /workspace/nccl-tests/build/all_gather.o
Linking  /workspace/nccl-tests/build/all_gather.o > /workspace/nccl-tests/build/all_gather_perf
Compiling  broadcast.cu                        > /workspace/nccl-tests/build/broadcast.o
Linking  /workspace/nccl-tests/build/broadcast.o > /workspace/nccl-tests/build/broadcast_perf
Compiling  reduce_scatter.cu                   > /workspace/nccl-tests/build/reduce_scatter.o
Linking  /workspace/nccl-tests/build/reduce_scatter.o > /workspace/nccl-tests/build/reduce_scatter_perf
Compiling  reduce.cu                           > /workspace/nccl-tests/build/reduce.o
Linking  /workspace/nccl-tests/build/reduce.o > /workspace/nccl-tests/build/reduce_perf
Compiling  alltoall.cu                         > /workspace/nccl-tests/build/alltoall.o
Linking  /workspace/nccl-tests/build/alltoall.o > /workspace/nccl-tests/build/alltoall_perf
Compiling  scatter.cu                          > /workspace/nccl-tests/build/scatter.o
Linking  /workspace/nccl-tests/build/scatter.o > /workspace/nccl-tests/build/scatter_perf
Compiling  gather.cu                           > /workspace/nccl-tests/build/gather.o
Linking  /workspace/nccl-tests/build/gather.o > /workspace/nccl-tests/build/gather_perf
Compiling  sendrecv.cu                         > /workspace/nccl-tests/build/sendrecv.o
Linking  /workspace/nccl-tests/build/sendrecv.o > /workspace/nccl-tests/build/sendrecv_perf
Compiling  hypercube.cu                        > /workspace/nccl-tests/build/hypercube.o
Linking  /workspace/nccl-tests/build/hypercube.o > /workspace/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/workspace/nccl-tests/src'

root@debug:/workspace/nccl-tests# export NCCL_DEBUG=DEBUG

root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1170 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1170 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

Get stuck here…

I don’t know if convert the API training spec to the local spec or wait to solve this first.

Hi,
In the TAO API - Detectnet_v2 - Multi GPU Stuck, you cannot run nccl-test successfully under either
nvcr.io/nvidia/tensorrt:22.11-py3 or nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5. The solution is to export NCCL_P2P_LEVEL=NVL.

Currently, we consider another solution.
Could you please exit the debug pod and run below command in your machine? Please share us with the result.
$ sudo lspci -vvv | grep ACSCtl
$ dmesg | grep IOMMU

Try to use the export NCCL_P2P_LEVEL=NVL, and the same result.

root@debug:/workspace# export NCCL_P2P_LEVEL=NVL
root@debug:/workspace# export NCCL_DEBUG=TRACE
root@debug:/workspace/nccl-tests#  ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  12348 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid  12348 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
debug:12348:12348 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.98<0>
debug:12348:12348 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug:12348:12348 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug:12348:12348 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug:12348:12348 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug:12348:12348 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.16.5+cuda12.0
debug:12348:12358 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug:12348:12358 [1] NCCL INFO P2P plugin IBext
debug:12348:12358 [1] NCCL INFO NET/IB : No device found.
debug:12348:12358 [1] NCCL INFO NET/IB : No device found.
debug:12348:12358 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.98<0>
debug:12348:12358 [1] NCCL INFO Using network Socket
debug:12348:12357 [0] NCCL INFO Using network Socket
debug:12348:12357 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
debug:12348:12357 [0] NCCL INFO Channel 00/04 :    0   1
debug:12348:12357 [0] NCCL INFO Channel 01/04 :    0   1
debug:12348:12357 [0] NCCL INFO Channel 02/04 :    0   1
debug:12348:12358 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
debug:12348:12358 [1] NCCL INFO P2P Chunksize set to 131072
debug:12348:12357 [0] NCCL INFO Channel 03/04 :    0   1
debug:12348:12357 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
debug:12348:12357 [0] NCCL INFO P2P Chunksize set to 131072

debug:12348:12358 [1] misc/shmutils.cc:103 NCCL WARN Cuda failure 'invalid argument'
[debug:12348:0:12357] Caught signal 7 (Bus error: nonexistent physical address)

debug:12348:12358 [1] misc/shmutils.cc:114 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-QUqg7T (size 9637888)
Bus error (core dumped)

$ sudo lspci -vvv | grep ACSCtl
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
$ dmesg | grep IOMMU
[    2.512483] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[    2.512519] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    2.512540] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[    2.512558] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    2.542771] pci 0000:60:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.542776] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.542780] pci 0000:20:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.542782] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.545858] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    2.545870] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    2.545879] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    2.545888] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).

Hi,
The hang is due to PCI switch Access control system (ACS) which block the P2P call on the system.

Troubleshooting:

  • PCI ACS is enabled which will block the NCCL P2P calls.
  • IOMMU is not in pass-through mode:
    • Check: dmesg | grep IOMMU
      • (If the board and CPU has IOMMU). If there is no message about IOMMU, then IOMMU is not enabled.
    • Solution:
      1. Edit /etc/default/grub:
        #AMD CPU
        GRUB_CMDLINE_LINUX_DEFAULT="<Other options> amd_iommu=on iommu=pt"
       #INTEL CPU
         GRUB_CMDLINE_LINUX_DEFAULT="<Other options> intel_iommu=on iommu=pt"
      
      1. Upgrade the grub
        $ sudo update-grub
      2. Reboot machine
        $ sudo shutdown -r now
1 Like

Thanks for the instructions. I continue blocked in the MultiGPU training.

Disable in the BIOS the IOMMU and the ACS.
Now the logs are the next:

tkeic@azken:~$ sudo dmesg | grep IOMMU
tkeic@azken:~$ 

tkeic@azken:~$ sudo lspci -vvv | grep ACSCtl
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

BUT…I get stuck in the same point in the AUTOML training with 2 GPUS:

INFO:tensorflow:Graph was finalized.
2023-08-07 10:10:07,733 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-08-07 10:10:10,388 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-08-07 10:10:11,018 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-08-07 10:10:21,599 [TAO Toolkit] [INFO] tensorflow 81: Saving checkpoints for step-0.
[2023-08-07 10:11:28. 10180: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
0: [DistributedAdamOptimizer_Allreduce/cond_142/HorovodAllreduce_mul_333_0, DistributedAdamOptimizer_Allreduce/cond_143/HorovodAllreduce_mul_334_0, DistributedAdamOptimizer_Allreduce/cond_144/HorovodAllreduce_mul_335_0, DistributedAdamOptimizer_Allreduce/cond_145/HorovodAllreduce_mul_336_0, DistributedAdamOptimizer_Allreduce/cond_146/HorovodAllreduce_mul_337_0, DistributedAdamOptimizer_Allreduce/cond_147/HorovodAllreduce_mul_338_0 ...]
[2023-08-07 10:12:28. 10561: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
0: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_191_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_192_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_201_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_291_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_292_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_293_0 ...]
1: [DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_355_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_356_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_357_0, DistributedAdamOptimizer_Allreduce/cond_103/HorovodAllreduce_mul_358_0, DistributedAdamOptimizer_Allreduce/cond_104/HorovodAllreduce_mul_359_0, DistributedAdamOptimizer_Allreduce/cond_105/HorovodAllreduce_mul_360_0 ...]

Also, I try again the nccl-tests method:

root@debug:/workspace/nccl-tests# export NCCL_DEBUG=TRACE
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1153 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1153 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
debug:1153:1153 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.104<0>
debug:1153:1153 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug:1153:1153 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug:1153:1153 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug:1153:1153 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug:1153:1153 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.16.5+cuda12.0
debug:1153:1162 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug:1153:1162 [0] NCCL INFO P2P plugin IBext
debug:1153:1162 [0] NCCL INFO NET/IB : No device found.
debug:1153:1162 [0] NCCL INFO NET/IB : No device found.
debug:1153:1162 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.104<0>
debug:1153:1162 [0] NCCL INFO Using network Socket
debug:1153:1163 [1] NCCL INFO Using network Socket
debug:1153:1162 [0] NCCL INFO Channel 00/04 :    0   1
debug:1153:1162 [0] NCCL INFO Channel 01/04 :    0   1
debug:1153:1163 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
debug:1153:1163 [1] NCCL INFO P2P Chunksize set to 131072
debug:1153:1162 [0] NCCL INFO Channel 02/04 :    0   1
debug:1153:1162 [0] NCCL INFO Channel 03/04 :    0   1
debug:1153:1162 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
debug:1153:1162 [0] NCCL INFO P2P Chunksize set to 131072
debug:1153:1162 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1153:1163 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1153:1162 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1153:1163 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1153:1162 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1153:1163 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1153:1162 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1153:1163 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1153:1162 [0] NCCL INFO Connected all rings
debug:1153:1162 [0] NCCL INFO Connected all trees
debug:1153:1163 [1] NCCL INFO Connected all rings
debug:1153:1163 [1] NCCL INFO Connected all trees
debug:1153:1163 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
debug:1153:1163 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1153:1162 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
debug:1153:1162 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1153:1162 [0] NCCL INFO comm 0x55b033df9cd0 rank 0 nranks 2 cudaDev 0 busId 21000 commId 0x78df2eb5ffee4e8d - Init COMPLETE
debug:1153:1163 [1] NCCL INFO comm 0x55b033dfe420 rank 1 nranks 2 cudaDev 1 busId 22000 commId 0x78df2eb5ffee4e8d - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     7.42    0.00    0.00      0     6.18    0.00    0.00      0
          16             4     float     sum      -1     6.71    0.00    0.00      0     6.06    0.00    0.00      0
          32             8     float     sum      -1     6.19    0.01    0.01      0     6.09    0.01    0.01      0
          64            16     float     sum      -1     6.14    0.01    0.01      0     6.13    0.01    0.01      0
         128            32     float     sum      -1     6.12    0.02    0.02      0     6.08    0.02    0.02      0
         256            64     float     sum      -1     6.12    0.04    0.04      0     6.07    0.04    0.04      0
         512           128     float     sum      -1     9.43    0.05    0.05      0     6.15    0.08    0.08      0
        1024           256     float     sum      -1     6.15    0.17    0.17      0     6.10    0.17    0.17      0
        2048           512     float     sum      -1     6.21    0.33    0.33      0     6.10    0.34    0.34      0
        4096          1024     float     sum      -1     6.33    0.65    0.65      0     6.22    0.66    0.66      0
        8192          2048     float     sum      -1     6.71    1.22    1.22      0     6.67    1.23    1.23      0
       16384          4096     float     sum      -1     7.47    2.19    2.19      0     7.37    2.22    2.22      0
       32768          8192     float     sum      -1     8.85    3.70    3.70      0     8.79    3.73    3.73      0
       65536         16384     float     sum      -1    12.06    5.43    5.43      0    12.03    5.45    5.45      0
      131072         32768     float     sum      -1    25.03    5.24    5.24      0    24.88    5.27    5.27      0
      262144         65536     float     sum      -1    34.98    7.49    7.49      0    38.05    6.89    6.89      0
      524288        131072     float     sum      -1    49.06   10.69   10.69      0    49.49   10.59   10.59      0
     1048576        262144     float     sum      -1    81.77   12.82   12.82      0    64.79   16.18   16.18      0
     2097152        524288     float     sum      -1    108.2   19.39   19.39      0    107.2   19.56   19.56      0
     4194304       1048576     float     sum      -1    197.9   21.19   21.19      0    197.6   21.22   21.22      0
     8388608       2097152     float     sum      -1    386.3   21.72   21.72      0    384.2   21.84   21.84      0
    16777216       4194304     float     sum      -1    762.6   22.00   22.00      0    760.0   22.08   22.08      0
    33554432       8388608     float     sum      -1   1513.3   22.17   22.17      0   1512.6   22.18   22.18      0
    67108864      16777216     float     sum      -1   3012.9   22.27   22.27      0   3011.5   22.28   22.28      0
   134217728      33554432     float     sum      -1   6018.4   22.30   22.30      0   6014.8   22.31   22.31      0
debug:1153:1153 [1] NCCL INFO comm 0x55b033df9cd0 rank 0 nranks 2 cudaDev 0 busId 21000 - Destroy COMPLETE
debug:1153:1153 [1] NCCL INFO comm 0x55b033dfe420 rank 1 nranks 2 cudaDev 1 busId 22000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.10962 
#

I think that it’s good. No export variables. Open the pod, clone the repo, compile and launch.

OK, so nccl-test is fine.
For detectnet_v2 training, to narrow down, could you please generate a new debug pod for 4.0.1 or 22.05 and test in it?
nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3

For example, you can set

  ...
  name: debug-4
  spec:
  restartPolicy: OnFailure
  containers:
  - name: "detectnetv2"
    image: "nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5"
  ...

AArrrr, all this time lost by the IO virtualization…

LOG
root@debug-tao4:/workspace/nccl-tests# export NCCL_DEBUG=TRACE
root@debug-tao4:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    983 on debug-tao4 device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid    983 on debug-tao4 device  1 [0x22] NVIDIA RTX 6000 Ada Generation
debug-tao4:983:983 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.103<0>
debug-tao4:983:983 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug-tao4:983:983 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug-tao4:983:983 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug-tao4:983:983 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug-tao4:983:983 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
debug-tao4:983:992 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug-tao4:983:992 [0] NCCL INFO P2P plugin IBext
debug-tao4:983:992 [0] NCCL INFO NET/IB : No device found.
debug-tao4:983:992 [0] NCCL INFO NET/IB : No device found.
debug-tao4:983:992 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.103<0>
debug-tao4:983:992 [0] NCCL INFO Using network Socket
debug-tao4:983:993 [1] NCCL INFO Using network Socket
debug-tao4:983:992 [0] NCCL INFO Channel 00/04 :    0   1
debug-tao4:983:992 [0] NCCL INFO Channel 01/04 :    0   1
debug-tao4:983:992 [0] NCCL INFO Channel 02/04 :    0   1
debug-tao4:983:992 [0] NCCL INFO Channel 03/04 :    0   1
debug-tao4:983:993 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
debug-tao4:983:992 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
debug-tao4:983:992 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug-tao4:983:993 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug-tao4:983:992 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug-tao4:983:993 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug-tao4:983:992 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug-tao4:983:993 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug-tao4:983:992 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug-tao4:983:993 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug-tao4:983:992 [0] NCCL INFO Connected all rings
debug-tao4:983:993 [1] NCCL INFO Connected all rings
debug-tao4:983:992 [0] NCCL INFO Connected all trees
debug-tao4:983:993 [1] NCCL INFO Connected all trees
debug-tao4:983:993 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
debug-tao4:983:993 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug-tao4:983:992 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
debug-tao4:983:992 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug-tao4:983:993 [1] NCCL INFO comm 0x561be0c3db50 rank 1 nranks 2 cudaDev 1 busId 22000 - Init COMPLETE
debug-tao4:983:992 [0] NCCL INFO comm 0x561be0c3b0c0 rank 0 nranks 2 cudaDev 0 busId 21000 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     6.05    0.00    0.00      0     6.10    0.00    0.00      0
          16             4     float     sum      -1     7.36    0.00    0.00      0     6.08    0.00    0.00      0
          32             8     float     sum      -1     7.50    0.00    0.00      0     6.06    0.01    0.01      0
          64            16     float     sum      -1     7.60    0.01    0.01      0     6.03    0.01    0.01      0
         128            32     float     sum      -1     7.36    0.02    0.02      0     6.03    0.02    0.02      0
         256            64     float     sum      -1     7.72    0.03    0.03      0     6.02    0.04    0.04      0
         512           128     float     sum      -1     7.37    0.07    0.07      0     6.07    0.08    0.08      0
        1024           256     float     sum      -1     7.68    0.13    0.13      0     6.06    0.17    0.17      0
        2048           512     float     sum      -1     7.40    0.28    0.28      0     6.21    0.33    0.33      0
        4096          1024     float     sum      -1     7.85    0.52    0.52      0     6.24    0.66    0.66      0
        8192          2048     float     sum      -1     7.62    1.08    1.08      0     6.62    1.24    1.24      0
       16384          4096     float     sum      -1     8.51    1.93    1.93      0     7.31    2.24    2.24      0
       32768          8192     float     sum      -1     9.29    3.53    3.53      0     8.71    3.76    3.76      0
       65536         16384     float     sum      -1    11.79    5.56    5.56      0    11.91    5.50    5.50      0
      131072         32768     float     sum      -1    24.35    5.38    5.38      0    24.33    5.39    5.39      0
      262144         65536     float     sum      -1    37.78    6.94    6.94      0    37.59    6.97    6.97      0
      524288        131072     float     sum      -1    51.90   10.10   10.10      0    54.64    9.59    9.59      0
     1048576        262144     float     sum      -1    78.71   13.32   13.32      0    65.04   16.12   16.12      0
     2097152        524288     float     sum      -1    108.0   19.41   19.41      0    107.8   19.46   19.46      0
     4194304       1048576     float     sum      -1    198.3   21.15   21.15      0    197.3   21.26   21.26      0
     8388608       2097152     float     sum      -1    383.6   21.87   21.87      0    383.1   21.90   21.90      0
    16777216       4194304     float     sum      -1    760.2   22.07   22.07      0    759.4   22.09   22.09      0
    33554432       8388608     float     sum      -1   1510.3   22.22   22.22      0   1507.3   22.26   22.26      0
    67108864      16777216     float     sum      -1   3014.7   22.26   22.26      0   3004.5   22.34   22.34      0
   134217728      33554432     float     sum      -1   6008.4   22.34   22.34      0   6000.4   22.37   22.37      0
debug-tao4:983:983 [1] NCCL INFO comm 0x561be0c3b0c0 rank 0 nranks 2 cudaDev 0 busId 21000 - Destroy COMPLETE
debug-tao4:983:983 [1] NCCL INFO comm 0x561be0c3db50 rank 1 nranks 2 cudaDev 1 busId 22000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.0803 
#

Need more time to create the tfrecords with the container, export the train specs and test a train with that…

NOTE: I will edit the before post with the export NCCL_DEBUG=TRACE to compare both TAO.
NOTE2: Also try in TAO5 the normal TRAIN without autoML and the errors about horovod are the same. No process in the training

I don’t understand anything.

TAO5.
Deploy the pod.
Generate the tfrecods.
Launch the train.

WAIT ITS WORKING- In this edit, modify the image_extension from: image_extension: ".jpg" to image_extension: "jpg", and now start the training.
Attach a new log.
log_tao5_train_working.txt (138.5 KB)

OLD

FAIL → Try to search pictures that exist, but adding a extra dot in the file extension.

 (1) Not found:  /workspace/tao-experiments/GLOBAL_DATASET/training_v04/training/images/synt_seq2_10203_resize_bright..jpg; No such file or directory

Attach full logs:
log.txt (98.2 KB)

TAO4.
Deploy the pod.
Generate the tfrecods.
Launch the train.

FAIL → Operation not permitted. Again appear horovod
Attach full logs:
log_tao4.txt (98.2 KB)

Could you upload again for TAO5? Seems that you are uploading the same as TAO4.

Update the post before.

TAO5 is working. Don’t undestand why with the API don’t start the process.

Glad to know the 2gpus training without TAO API can work.

To narrow down, for 2gpus training with TAO API, could you set below
specs["evaluation_config"]["first_validation_epoch"] = 10
to
specs["evaluation_config"]["first_validation_epoch"] = 99

Since below is a generic horovod error which usually comes because one of the ranks is still running a process and the other ones are waiting for it. Usually happens when running evaluation or model checkpointing which runs only on rank-0 and may take a little while.

Hi,
Please help run below to narrow down.
Exp1
Instead of “sleep 360000”, we apply below new yaml to run commands directly.

$ cat debug-1.yaml
apiVersion: v1
kind: Pod
metadata:
 name: debug-1
spec:
 restartPolicy: OnFailure
 containers:
 - name: "detectnetv2"
   image: "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5"
   command: ["detectnet_v2"]
   args: ["train", "-e", "/my-workdir/detectnet_v2_spec.txt", "-r", "/my-workdir/result", "-k", "key", "--gpus", "2"]
   resources:
     limits:
        nvidia.com/gpu: 2
   volumeMounts:
   - name: "my-workdir"
     mountPath: "/my-workdir"
     #subPath: "<file1>"
 volumes:
 - name: my-workdir
   hostPath:
     path: /localhome/local-morganh/

$ kubectl delete pod debug
$ kubectl apply -f debug-1.yaml
Then, check the log via
$ kubectl logs -f debug-1

Exp2
Please run yolov_4 network. Please generate yolo_v4_spec.txt. You can refer to the specs in the notebook. Do not care the parameters such as big_anchor_shape. We just want to make sure the yolov4 can also work with 2gpus.

$ cat debug-yolov4.yaml
apiVersion: v1
kind: Pod
metadata:
  name: debug-yolov4
spec:
  restartPolicy: OnFailure
  containers:
  - name: "yolov4"
    image: "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5"
    command: ["yolo_v4"]
    args: ["train", "-e", "/my-workdir/yolo_v4_spec.txt", "-r", "/my-workdir/result", "-k", "key", "--gpus", "2"]
    resources:
      limits:
         nvidia.com/gpu: 2
    volumeMounts:
    - name: "my-workdir"
      mountPath: "/my-workdir"
      #subPath: "<file1>"
  volumes:
  - name: my-workdir
    hostPath:
      path: /localhome/local-morganh/

$ kubectl delete pod debug-1
$ kubectl apply -f debug-yolov4.yaml
Then, check the log via
$ kubectl logs -f debug-yolov4

Exp3
If experiment2 works, please run yolov4 network with notebooks/tao_api_starter_kit/api/object_detection.ipynb.
I recall you run some other notebooks with 2gpus. Besides detectnet_v2, can you confirm which network or notebook can run with 2gpus?

Exp4
In notebooks/tao_api_starter_kit/api/object_detection.ipynb , please run detectnet_v2 network again. When you run into the cell which is running “detectnet_v2 train”, usually a new pod will prompt in “kubectl get pods”. Please describe it via “kubectl describe pod this-pod-name” and share the log with us. We need to check what the exact commands they are.

So beautiful to be real.

Same behaivour. Pod created, start to load stuff and do things, and get in the same point with the horovod message.
Attach the GPU utilization/memory.

Attach log generated by the pod:
08c4de03-79c0-47e9-b96f-c7cceb8ef163.txt (130.7 KB)
Train.json and protobuf
08c4de03-79c0-47e9-b96f-c7cceb8ef163.protobuf (10.5 KB)
train.json (16.0 KB)

Also attach the describe pod, related to the Exp4:

tkeic@azken:~$ kubectl describe pod 08c4de03-79c0-47e9-b96f-c7cceb8ef163-98nws
Name:         08c4de03-79c0-47e9-b96f-c7cceb8ef163-98nws
Namespace:    default
Priority:     0
Node:         azken/10.1.1.10
Start Time:   Wed, 09 Aug 2023 09:07:29 +0200
Labels:       controller-uid=a1f5c025-9d77-486c-8056-16a2b737e44d
              job-name=08c4de03-79c0-47e9-b96f-c7cceb8ef163
              purpose=tao-toolkit-job
Annotations:  cni.projectcalico.org/containerID: ed39e2b2accc67112f10b17cf62298b34bdeeddd9384c507056e22c85c7d6846
              cni.projectcalico.org/podIP: 192.168.35.109/32
              cni.projectcalico.org/podIPs: 192.168.35.109/32
Status:       Running
IP:           192.168.35.109
IPs:
  IP:           192.168.35.109
Controlled By:  Job/08c4de03-79c0-47e9-b96f-c7cceb8ef163
Containers:
  container:
    Container ID:  containerd://2f5d7c3ed9b073ee295654d856961c3cc72f73991989ea2e308ea098f32a1606
    Image:         nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
    Image ID:      nvcr.io/nvidia/tao/tao-toolkit@sha256:17edbefc6428c656e0d8ae50e9460d22cb18e37e2b90d6640da4d33c203aacfe
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      umask 0 && detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/40e607aa-28fa-46ec-9474-13da19bb10b6/specs/08c4de03-79c0-47e9-b96f-c7cceb8ef163.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/40e607aa-28fa-46ec-9474-13da19bb10b6/08c4de03-79c0-47e9-b96f-c7cceb8ef163/ --verbose --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/40e607aa-28fa-46ec-9474-13da19bb10b6/logs/08c4de03-79c0-47e9-b96f-c7cceb8ef163.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/40e607aa-28fa-46ec-9474-13da19bb10b6/logs/08c4de03-79c0-47e9-b96f-c7cceb8ef163.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/40e607aa-28fa-46ec-9474-13da19bb10b6/08c4de03-79c0-47e9-b96f-c7cceb8ef163/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/40e607aa-28fa-46ec-9474-13da19bb10b6/08c4de03-79c0-47e9-b96f-c7cceb8ef163/ -type f | xargs chmod 666
    State:          Running
      Started:      Wed, 09 Aug 2023 09:07:31 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  2
    Requests:
      nvidia.com/gpu:  2
    Environment:
      NUM_GPU_PER_NODE:        2
      TELEMETRY_OPT_OUT:       no
      WANDB_API_KEY:           XXXX
      CLEARML_WEB_HOST:        https://app.clear.ml
      CLEARML_API_HOST:        https://api.clear.ml
      CLEARML_FILES_HOST:      https://files.clear.ml
      CLEARML_API_ACCESS_KEY:  XXXX
      CLEARML_API_SECRET_KEY:  XXXX
    Mounts:
      /dev/shm from dshm (rw)
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pr2d2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-pr2d2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  43s   default-scheduler  Successfully assigned default/08c4de03-79c0-47e9-b96f-c7cceb8ef163-98nws to azken
  Normal  Pulled     41s   kubelet            Container image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5" already present on machine
  Normal  Created    40s   kubelet            Created container container
  Normal  Started    40s   kubelet            Started container container

We try to start now with the other Exp.

Quick request. Could you dump the yaml for the training pod you are running?
$ kubectl get pods the-training-pod-name -o yaml > training.yaml

Late, just kill the pod to start other experiment.

So. I think that found where the issue comes. Checking the specs between the manual detecnet_v2 train and automatic pod train, the only one is the specs["training_config"]["visualizer"]["enabled"] = True

Just change to false and the train start…

Attach log:
384c23a8-bc88-42a9-9105-47dd189d131a.txt (189.9 KB)

Attach picture:

I am a little confused. Last Friday, you can run successfully with the debug pod(apply for my debug.yaml). So, how about the specs["training_config"]["visualizer"]["enabled"] when you run debug pod?

I see now. The specs["training_config"]["visualizer"]["enabled"] is false when you run debug pod.

Yes, this point is specific from the spec file.
The point is that the spec file used in this debug pod, have this parameter OFF.
Attach extract from the spec file used in the pod:

training_config {
  batch_size_per_gpu: 24
  num_epochs: 100
  checkpoint_interval: 10
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.100000001
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-09
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.899999976
      beta2: 0.999000013
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  visualizer {
    enabled: false
  }
}

And this is the spec file used in the API pod:

training_config {
  batch_size_per_gpu: 24
  checkpoint_interval: 10
  cost_scaling {
    decrement: 1.0
    enabled: False
    increment: 0.005
    initial_exponent: 20.0
  }
  enable_qat: False
  learning_rate {
    soft_start_annealing_schedule {
      annealing: 0.699999988
      max_learning_rate: 0.0005
      min_learning_rate: 5e-06
      soft_start: 0.100000001
    }
  }
  num_epochs: 100
  optimizer {
    adam {
      beta1: 0.899999976
      beta2: 0.999000013
      epsilon: 1e-08
    }
  }
  regularizer {
    type: L1
    weight: 3e-09
  }
  visualizer {
    enabled: True
    infrequent_logging_frequency: 5
    num_images: 3
  }
}

This is the only important difference between both iterations.
In the API pod have this point enabled because i want to start using clear.ml and wandb to analyse the trainings.