TAO5 - Detectnet_v2 - MultiGPU TAO-API Dead at train start

In latest 5.0 notebooks,

The default value in gpu-operator-values.yml is

enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: “525.85.12
install_driver: true

You can also find this value in the repo.

More, could you please try another node_port?
On my side,

   $ kubectl get endpoints
     NAME                                            ENDPOINTS                            AGE
     cluster.local-nfs-subdir-external-provisioner   <none>                               4d21h
     ingress-nginx-controller                        192.168.34.82:443,192.168.34.82:80   4d21h
     kubernetes                                      10.34.4.209:6443                     4d21h
     tao-toolkit-api-jupyterlab-service              192.168.34.85:8888                   4d21h
     tao-toolkit-api-service                         192.168.34.87:8000                   4d21h
   $ kubectl get services
   NAME                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
   ingress-nginx-controller             NodePort    10.100.40.251   <none>        80:32080/TCP,443:32443/TCP   4d21h
   kubernetes                           ClusterIP   10.96.0.1       <none>        443/TCP                      4d21h
   tao-toolkit-api-jupyterlab-service   NodePort    10.98.32.241    <none>        8888:31952/TCP               4d21h
   tao-toolkit-api-service              NodePort    10.96.101.73    <none>        8000:31951/TCP               4d21h

$ hostname -i
127.0.1.1

And then, I set as below.

Same Issue with this driver version.

setup.sh uninstall

Delete all the drivers and reboot.

setup.sh install with the version 525.85.12. As pod.

Convert datasets works:

detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/specs/06d77ab5-019f-4182-ac8d-e9dfebdb13e5.protobuf  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/logs/06d77ab5-019f-4182-ac8d-e9dfebdb13e5.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/logs/06d77ab5-019f-4182-ac8d-e9dfebdb13e5.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 06d77ab5-019f-4182-ac8d-e9dfebdb13e5
Post running
Toolkit status for 06d77ab5-019f-4182-ac8d-e9dfebdb13e5 is SUCCESS
Job Done: 06d77ab5-019f-4182-ac8d-e9dfebdb13e5 Final status: Done
detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/specs/6b3e21ed-6beb-48a1-985d-8a38890aa63d.protobuf  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/logs/6b3e21ed-6beb-48a1-985d-8a38890aa63d.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/logs/6b3e21ed-6beb-48a1-985d-8a38890aa63d.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 6b3e21ed-6beb-48a1-985d-8a38890aa63d
Post running
Toolkit status for 6b3e21ed-6beb-48a1-985d-8a38890aa63d is SUCCESS
Job Done: 6b3e21ed-6beb-48a1-985d-8a38890aa63d Final status: Done

But the train not launch the new train pod.

But well recived by the API
172.16.1.2 - - [01/Aug/2023:07:36:45 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/a47ffabc-e471-46c6-a361-1dd34be9b001/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"

And after that the API pod get the status the Unhealthy.

Could you please elaborate more for above steps?
After the “dataset_convert” successfully, you run into the cell “Run train” for autoML training, right? Do you mean you cannot find a new pod when run “kubectl get pods” ? And then find the unhealthy info in “kubectl logs -f tao-toolkit-api-app-pod-xxxxx-xxxx” ?

Can’t never work using the Ingress-nginx port, neither with TAO4.

$ curl http://127.0.1.1:32080/api/v1
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

You can run
$ curl http://127.0.1.1:32080/api/v1/login/your_ngc_key

The same as the first post. NOW: Avoid the AutoML process. And avoid multi GPU.
1xGPU - Normal train - NO use_amp.

Login in the pod → reload the datasets (preview loaded in other projects) → create the automatic specs and personalize them → Run the convert dataset process → Watch how the new POD was created, read the log and verify that all the images are recogniced and the TFrecods generated → Create the new “model_id” → Create the automatic specs and personalize them → Review that make sense the generated file → Launch the train JOB ->#The POD is NOT created with the train process → The API POD go to Unhealthy and unreacheable.

Oh f**k, needs to revive the tao POD.

$ curl http://127.0.1.1:32080/api/v1/login/xxxxxxx
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx</center>
</body>
</html>

I don’t know how to revive the pod, without uninstall everything… try to drain the node. But continue unhealthy.

How about the result of
$ kubectl describe service tao-toolkit-api-service

$ kubectl describe service tao-toolkit-api-service
Name:                     tao-toolkit-api-service
Namespace:                default
Labels:                   app.kubernetes.io/managed-by=Helm
Annotations:              meta.helm.sh/release-name: tao-toolkit-api
                          meta.helm.sh/release-namespace: default
Selector:                 name=tao-toolkit-api-app-pod
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.104.52.184
IPs:                      10.104.52.184
Port:                     api  8000/TCP
TargetPort:               8000/TCP
NodePort:                 api  31951/TCP
Endpoints:                
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

And the last describe from the API pod

$ kubectl describe pod tao-toolkit-api-app-pod-55c5d88d86-2xbm7
Name:         tao-toolkit-api-app-pod-55c5d88d86-2xbm7
Namespace:    default
Priority:     0
Node:         azken/10.1.1.10
Start Time:   Tue, 01 Aug 2023 10:17:49 +0200
Labels:       name=tao-toolkit-api-app-pod
              pod-template-hash=55c5d88d86
Annotations:  cni.projectcalico.org/containerID: 4aac52a87cbc260611c0b3d1146e2d8f4ef30ee6c2ec948d09581207265c6fde
              cni.projectcalico.org/podIP: 192.168.35.124/32
              cni.projectcalico.org/podIPs: 192.168.35.124/32
Status:       Running
IP:           192.168.35.124
IPs:
  IP:           192.168.35.124
Controlled By:  ReplicaSet/tao-toolkit-api-app-pod-55c5d88d86
Containers:
  tao-toolkit-api-app:
    Container ID:   containerd://5a8f77c6358698d420a6db88ae57ff8d448d0f58c908acce43f6038cbf14cacc
    Image:          nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
    Image ID:       nvcr.io/nvidia/tao/tao-toolkit@sha256:45e93283d23a911477cc433ec43e233af1631e85ec0ba839e63780c30dd2d70b
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 01 Aug 2023 10:38:54 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 01 Aug 2023 10:19:05 +0200
      Finished:     Tue, 01 Aug 2023 10:36:06 +0200
    Ready:          False
    Restart Count:  3
    Liveness:       http-get http://:8000/api/v1/health/liveness delay=3s timeout=3s period=10s #success=1 #failure=3
    Readiness:      http-get http://:8000/api/v1/health/readiness delay=3s timeout=3s period=10s #success=1 #failure=3
    Environment:
      NAMESPACE:            default
      CLAIMNAME:            tao-toolkit-api-pvc
      IMAGEPULLSECRET:      imagepullsecret
      AUTH_CLIENT_ID:       bnSePYullXlG-504nOZeNAXemGF6DhoCdYR8ysm088w
      NUM_GPU_PER_NODE:     1
      BACKEND:              local-k8s
      IMAGE_TF1:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
      IMAGE_PYT:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
      IMAGE_TF2:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
      IMAGE_TAO_DEPLOY:     nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy
      IMAGE_DEFAULT:        nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
      IMAGE_API:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
      IMAGE_DATA_SERVICES:  nvcr.io/nvidia/tao/tao-toolkit:5.0.0-data-services
      PYTHONIOENCODING:     utf-8
      LC_ALL:               C.UTF-8
    Mounts:
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nw5ml (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  kube-api-access-nw5ml:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  58m (x2 over 60m)     default-scheduler  0/1 nodes are available: 1 node(s) were unschedulable.
  Normal   Scheduled         58m                   default-scheduler  Successfully assigned default/tao-toolkit-api-app-pod-55c5d88d86-2xbm7 to azken
  Normal   Pulled            57m                   kubelet            Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 8.632331063s
  Normal   Started           57m (x2 over 57m)     kubelet            Started container tao-toolkit-api-app
  Normal   Created           57m (x2 over 57m)     kubelet            Created container tao-toolkit-api-app
  Normal   Pulled            57m                   kubelet            Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 3.755364478s
  Warning  Unhealthy         57m (x6 over 57m)     kubelet            Liveness probe failed: Get "http://192.168.35.118:8000/api/v1/health/liveness": dial tcp 192.168.35.118:8000: connect: connection refused
  Warning  Unhealthy         57m (x8 over 57m)     kubelet            Readiness probe failed: Get "http://192.168.35.118:8000/api/v1/health/readiness": dial tcp 192.168.35.118:8000: connect: connection refused
  Normal   Killing           57m (x2 over 57m)     kubelet            Container tao-toolkit-api-app failed liveness probe, will be restarted
  Normal   Pulling           56m (x3 over 58m)     kubelet            Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api"
  Warning  Unhealthy         53m (x23 over 56m)    kubelet            Readiness probe failed: HTTP probe failed with statuscode: 400
  Normal   SandboxChanged    37m (x2 over 37m)     kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling           37m                   kubelet            Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api"
  Normal   Pulled            37m                   kubelet            Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 3.505803367s
  Normal   Created           37m                   kubelet            Created container tao-toolkit-api-app
  Normal   Started           37m                   kubelet            Started container tao-toolkit-api-app
  Warning  Unhealthy         36m                   kubelet            Readiness probe failed: Get "http://192.168.35.124:8000/api/v1/health/readiness": dial tcp 192.168.35.124:8000: connect: connection refused
  Warning  Unhealthy         36m                   kubelet            Liveness probe failed: Get "http://192.168.35.124:8000/api/v1/health/liveness": dial tcp 192.168.35.124:8000: connect: connection refused
  Warning  Unhealthy         115s (x229 over 35m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 400

How about
$ kubectl get endpoints

After reinstall, can login using the nginx port.

But only can login → try to get the dataset default specs → DIE

After the last dead:

$ kubectl get endpoints
NAME                                            ENDPOINTS                            AGE
cluster.local-nfs-subdir-external-provisioner   <none>                               161m
ingress-nginx-controller                        192.168.35.78:443,192.168.35.78:80   161m
kubernetes                                      10.1.1.10:6443                       168m
tao-toolkit-api-jupyterlab-service              192.168.35.74:8888                   9m49s
tao-toolkit-api-service                                                              9m49s

That is not expected to get empty endpoint for tao-toolkit-api-service .
The same unexpected result in Endpoints: when run $kubectl describe service tao-toolkit-api-service

Yehp…but…

This is after a TAO-API helm uninstall and install.

$ kubectl get endpoints
NAME                                            ENDPOINTS                            AGE
cluster.local-nfs-subdir-external-provisioner   <none>                               170m
ingress-nginx-controller                        192.168.35.78:443,192.168.35.78:80   170m
kubernetes                                      10.1.1.10:6443                       176m
tao-toolkit-api-jupyterlab-service              192.168.35.83:8888                   38s
tao-toolkit-api-service                         192.168.35.75:8000                   38s

How can debug that?

Any form to enter to the pod and watch whats happening?

Suggest to narrow down to check which cell will result into this behavior.
You can try to run another notebook to check if TAO-API works.
For example, notebooks/tao_api_starter_kit/client/data_services.ipynb

I will try to execute the entire notebook…

These notebook works well. Create all the important pods when its necesary.

I will repeate the detecnet process…

One question.
I’m watching that in some notebooks when you do the login, in the API url point to the {namespace} of kubernetes, and in other not.

I’m testing and both forms are interpreted by the cluster. But which is the correct one?

WTF!!!

With the API notebook, the train process start correctly.
So some bug are hidden in the TAO-CLIENT.

The API POST to start the job was freeze from a minute, but the start. So the api-client maybe have some timeout that makes that the pod dead.

Note: I’m using the {namespace} in the TAO API url.

detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/specs/695e9ea9-fa1c-4349-977a-e185fcd6f4a3.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/ --verbose --key=tlt_encode --gpus=1  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/logs/695e9ea9-fa1c-4349-977a-e185fcd6f4a3.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/logs/695e9ea9-fa1c-4349-977a-e185fcd6f4a3.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 695e9ea9-fa1c-4349-977a-e185fcd6f4a3
$ kubectl get pods -A
NAMESPACE             NAME                                                              READY   STATUS      RESTARTS       AGE
default               695e9ea9-fa1c-4349-977a-e185fcd6f4a3-86f5x                        1/1     Running     0              56s

LOG:

INFO:tensorflow:Graph was finalized.
2023-08-02 07:06:47,473 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-08-02 07:06:49,919 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-08-02 07:06:50,496 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-08-02 07:06:58,853 [TAO Toolkit] [INFO] tensorflow 81: Saving checkpoints for step-0.
INFO:tensorflow:epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.107709154, step = 0
2023-08-02 07:07:54,302 [TAO Toolkit] [INFO] tensorflow 262: epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.107709154, step = 0
2023-08-02 07:07:54,311 [TAO Toolkit] [INFO] root 2102: None
2023-08-02 07:07:54,319 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.tfhooks.task_progress_monitor_hook 149: Epoch 0/100: loss: 0.10771 learning rate: 4.9999994e-06 Time taken: 0:00:00 ETA: 0:00:00
2023-08-02 07:07:54,319 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 1.003
INFO:tensorflow:epoch = 0.002008032128514056, learning_rate = 5.0046265e-06, loss = 0.107642695, step = 2 (8.946 sec)
2023-08-02 07:08:03,247 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.002008032128514056, learning_rate = 5.0046265e-06, loss = 0.107642695, step = 2 (8.946 sec)
INFO:tensorflow:epoch = 0.023092369477911646, learning_rate = 5.053455e-06, loss = 0.104957655, step = 23 (5.456 sec)
2023-08-02 07:08:08,703 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.023092369477911646, learning_rate = 5.053455e-06, loss = 0.104957655, step = 23 (5.456 sec)
2023-08-02 07:08:08,962 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 15.551
INFO:tensorflow:epoch = 0.04417670682730923, learning_rate = 5.1027596e-06, loss = 0.1001761, step = 44 (5.476 sec)
2023-08-02 07:08:14,178 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.04417670682730923, learning_rate = 5.1027596e-06, loss = 0.1001761, step = 44 (5.476 sec)
2023-08-02 07:08:15,494 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 91.857
INFO:tensorflow:epoch = 0.06526104417670682, learning_rate = 5.1525503e-06, loss = 0.0976929, step = 65 (5.520 sec)
2023-08-02 07:08:19,699 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.06526104417670682, learning_rate = 5.1525503e-06, loss = 0.0976929, step = 65 (5.520 sec)
2023-08-02 07:08:22,081 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 91.099

OK, glad to know it can run. I think you are running with notebooks/tao_api_starter_kit/api/object_detection.ipynb, right?

Did you install latest tao-client? Will monitor the feedback from users since I cannot reproduce this.