TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck

Please provide the following information when requesting support.

• Hardware: 2x RTXA6000ADA
• Network Type: Detectnet_v2
• TLT Version: 5.0.0

After deploy, login and lauch the new TAO5, start the process with a multiGPU.

Following the other post [https://forums.developer.nvidia.com/t/tao5-detectnet-v2-multigpu-tao-api-dead-at-train-start](https://TAO5 - Detectnet_v2 - MultiGPU TAO-API Dead at train start)

After start correctly the TAO TRAINING POD with ONE (1) GPU and watch the logs are correct, and are performing the training, start a new train using TWO (2) GPUS.

The TAO TRAINING POD with multiple GPU start correctly, set and load the datasets, and interpret correctly the specs. But not perform the train. Get stuck in the first steps of the training:
Attach the full log.
16f1496a-7eee-44f2-9eb4-1c93e4f9720c.txt (125.8 KB)

INFO:tensorflow:Graph was finalized.
2023-08-02 07:37:02,535 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-08-02 07:37:05,157 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-08-02 07:37:05,797 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-08-02 07:37:16,659 [TAO Toolkit] [INFO] tensorflow 81: Saving checkpoints for step-0.

I stuck in this steps at least from 30 minutes or more. So I consider that the process is similar to my old issue with the TAO4: [TAO API - Detectnet_v2 - Multi GPU Stuck](https://TAO API - Detectnet_v2 - Multi GPU Stuck)

Suggestions?

As attempt to use all the “official” steps, the drivers are only in the Kubernetes cluster, so i don’t have the possibility to follow the GPU effort (nvtop), but i supose that the behavior is the same than before.

Did you config the gpus?
Please follow below step.

After Bare-Metal installation steps (bash setup.sh install), it will use the default helm values. If anything on chart has to be changed, then please run the following commands.

helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.0.0.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-5.0.0.tgz -C tao-toolkit-api
# uninstall old tao-api
helm ls
helm delete tao-toolkit-api

#change tao-toolkit-api/values.yaml
maxNumGpuPerNode:2  (Please set this to max gpus in your machine.)

# re install tao-api
helm install tao-toolkit-api tao-toolkit-api/ --namespace default

Yes… this was done before post the issue.

Here you have the Pod Describe:

$ kubectl describe pod tao-toolkit-api-app-pod-86f68f6547-mqq6j
Name:         tao-toolkit-api-app-pod-86f68f6547-mqq6j
Namespace:    default
Priority:     0
Node:         azken/10.1.1.10
Start Time:   Tue, 01 Aug 2023 12:21:30 +0200
Labels:       name=tao-toolkit-api-app-pod
              pod-template-hash=86f68f6547
Annotations:  cni.projectcalico.org/containerID: 5b259f26894a800e4678e2dafcffd48fb0f9ae73c5e7c1facf0cee3e343d7d7c
              cni.projectcalico.org/podIP: 192.168.35.73/32
              cni.projectcalico.org/podIPs: 192.168.35.73/32
Status:       Running
IP:           192.168.35.73
IPs:
  IP:           192.168.35.73
Controlled By:  ReplicaSet/tao-toolkit-api-app-pod-86f68f6547
Containers:
  tao-toolkit-api-app:
    Container ID:   containerd://774d47edd2483b160a382d81bf48e6bef4ae18ef9d87c0a32d84c0e4a973a0ef
    Image:          nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
    Image ID:       nvcr.io/nvidia/tao/tao-toolkit@sha256:45e93283d23a911477cc433ec43e233af1631e85ec0ba839e63780c30dd2d70b
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 01 Aug 2023 12:22:16 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 01 Aug 2023 12:21:39 +0200
      Finished:     Tue, 01 Aug 2023 12:22:13 +0200
    Ready:          True
    Restart Count:  1
    Liveness:       http-get http://:8000/api/v1/health/liveness delay=3s timeout=3s period=10s #success=1 #failure=3
    Readiness:      http-get http://:8000/api/v1/health/readiness delay=3s timeout=3s period=10s #success=1 #failure=3
    Environment:
      NAMESPACE:            default
      CLAIMNAME:            tao-toolkit-api-pvc
      IMAGEPULLSECRET:      imagepullsecret
      AUTH_CLIENT_ID:       xxxxxxx
      NUM_GPU_PER_NODE:     2
      BACKEND:              local-k8s
      IMAGE_TF1:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
      IMAGE_PYT:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
      IMAGE_TF2:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
      IMAGE_TAO_DEPLOY:     nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy
      IMAGE_DEFAULT:        nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
      IMAGE_API:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
      IMAGE_DATA_SERVICES:  nvcr.io/nvidia/tao/tao-toolkit:5.0.0-data-services
      PYTHONIOENCODING:     utf-8
      LC_ALL:               C.UTF-8
    Mounts:
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v8vwg (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  kube-api-access-v8vwg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

NUM_GPU_PER_NODE: 2

And in the spec file modified the new parameter:

    specs["gpus"] = 2  

Repeat your steps, in case be the causant… and the same result.
Stuck in the same point.

INFO:tensorflow:Done running local_init_op.
2023-08-02 11:57:16,075 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2023-08-02 11:57:18,766 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-08-02 11:57:21,394 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-08-02 11:57:21,997 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-08-02 11:57:32,627 [TAO Toolkit] [INFO] tensorflow 81: Saving checkpoints for step-0.

How can check that both GPU are working in the POD?

Have a procedure similar to the TAO4?

You can check $kubectl exec -n gpu-operator nvidia-smi-azken -- nvidia-smi

In the last ping-pongs with the kubernetes cluster, this pod dissapear. So use the gpu-operator driver to check it.

$ kubectl exec -n nvidia-gpu-operator -it nvidia-driver-daemonset-47qzh -- nvidia-smi
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
Thu Aug  3 08:34:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  On   | 00000000:21:00.0 Off |                  Off |
| 51%   77C    P2   269W / 300W |   9483MiB / 49140MiB |     94%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   39C    P8    28W / 300W |      3MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1783765      C   /usr/bin/python                   424MiB |
|    0   N/A  N/A   1783942      C   python                           9052MiB |
+-----------------------------------------------------------------------------+

That is not expected. After installation, there is a pod named “nvidia-smi-xxxx”.

Yeah, I watch them some in the timeline.

Can check my screwed kubernetes node:

Try again…

I think that you modify my original post. Don’t care.

When finish the automl train, test again a clean installation. I’m having nightmares with this application…

Oh, sorry… I just click edit against your post with mistake.

Resume it…

From above, currently, there is not nvidia-smi pod.
I still doubt there is still something mismatching when you uninstall and install TAO-API.
When you have time, please re-install and share with us the full log.
After installation, there should be a pod named nvidia-smi-xxxx. And you can run it for “nvidia-smi”.

Here the output:

$ kubectl exec nvidia-smi-azken -- nvidia-smi
Thu Aug  3 12:37:27 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  On   | 00000000:21:00.0 Off |                  Off |
| 30%   37C    P8    25W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   43C    P8    29W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

In some node drain to maintenance, lost the nvidia-smi pod.

Can’t add the numGpu TAO parameter in the ansible installation?

To deploy only once?

Try to launch the AUTOML train with multiGPU.

c4ee8702-784f-470d-bb7b-65158f76a1c9/experiment_0/
INFO:tensorflow:Graph was finalized.
2023-08-03 13:03:32,760 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-08-03 13:03:35,407 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-08-03 13:03:36,031 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-08-03 13:03:46,873 [TAO Toolkit] [INFO] tensorflow 81: Saving checkpoints for step-0.
[2023-08-03 13:05:40.195038: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
0: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_191_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_192_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_201_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_291_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_292_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_293_0 ...]
1: [DistributedAdamOptimizer_Allreduce/cond_134/HorovodAllreduce_mul_389_0, DistributedAdamOptimizer_Allreduce/cond_135/HorovodAllreduce_mul_390_0, DistributedAdamOptimizer_Allreduce/cond_136/HorovodAllreduce_mul_391_0, DistributedAdamOptimizer_Allreduce/cond_137/HorovodAllreduce_mul_392_0, DistributedAdamOptimizer_Allreduce/cond_138/HorovodAllreduce_mul_393_0, DistributedAdamOptimizer_Allreduce/cond_139/HorovodAllreduce_mul_394_0 ...]
[2023-08-03 13:06:40.195721: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
0: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_191_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_192_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_201_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_291_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_292_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_293_0 ...]
1: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_255_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_256_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_265_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_355_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_356_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_357_0 ...]

As mentioned above, for TAO5.0, need to set
maxNumGpuPerNode

Does 1 gpu work?

It’s ready for second time.

    Environment:
      NAMESPACE:            default
      CLAIMNAME:            tao-toolkit-api-pvc
      IMAGEPULLSECRET:      imagepullsecret
      AUTH_CLIENT_ID:       xxx
      NUM_GPU_PER_NODE:     2
      BACKEND:              local-k8s
      IMAGE_TF1:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
      IMAGE_PYT:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
      IMAGE_TF2:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
      IMAGE_TAO_DEPLOY:     nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy
      IMAGE_DEFAULT:        nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
      IMAGE_API:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
      IMAGE_DATA_SERVICES:  nvcr.io/nvidia/tao/tao-toolkit:5.0.0-data-services
      PYTHONIOENCODING:     utf-8
      LC_ALL:               C.UTF-8

Yes.

Please stop to skirting me. If the developers are on vacations… tell me, don’t make me to lose more time.

I will try to reproduce with 2Gpus. Not sure what is happening.

I have the same result with automl and withou. With use_amp and without.
Using multiple GPU get stuck always in the same point. At the begining of the train process.
With only one (1) GPU are working in both situations.
For clarify: using the API tao, deployed in the Kubernetes cluster using the setup.sh (Ansible) with all the suggested steps. Re-installed from scratch. And modify to include the multi GPU.