In latest 5.0 notebooks,
The default value in gpu-operator-values.yml is
enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: “525.85.12”
install_driver: true
You can also find this value in the repo.
In latest 5.0 notebooks,
The default value in gpu-operator-values.yml is
enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: “525.85.12”
install_driver: true
You can also find this value in the repo.
More, could you please try another node_port?
On my side,
$ kubectl get endpoints
NAME ENDPOINTS AGE
cluster.local-nfs-subdir-external-provisioner <none> 4d21h
ingress-nginx-controller 192.168.34.82:443,192.168.34.82:80 4d21h
kubernetes 10.34.4.209:6443 4d21h
tao-toolkit-api-jupyterlab-service 192.168.34.85:8888 4d21h
tao-toolkit-api-service 192.168.34.87:8000 4d21h
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller NodePort 10.100.40.251 <none> 80:32080/TCP,443:32443/TCP 4d21h
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 4d21h
tao-toolkit-api-jupyterlab-service NodePort 10.98.32.241 <none> 8888:31952/TCP 4d21h
tao-toolkit-api-service NodePort 10.96.101.73 <none> 8000:31951/TCP 4d21h
$ hostname -i
127.0.1.1
And then, I set as below.
Same Issue with this driver version.
setup.sh uninstall
Delete all the drivers and reboot.
setup.sh install with the version 525.85.12. As pod.
Convert datasets works:
detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/specs/06d77ab5-019f-4182-ac8d-e9dfebdb13e5.protobuf > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/logs/06d77ab5-019f-4182-ac8d-e9dfebdb13e5.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/logs/06d77ab5-019f-4182-ac8d-e9dfebdb13e5.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 06d77ab5-019f-4182-ac8d-e9dfebdb13e5
Post running
Toolkit status for 06d77ab5-019f-4182-ac8d-e9dfebdb13e5 is SUCCESS
Job Done: 06d77ab5-019f-4182-ac8d-e9dfebdb13e5 Final status: Done
detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/specs/6b3e21ed-6beb-48a1-985d-8a38890aa63d.protobuf > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/logs/6b3e21ed-6beb-48a1-985d-8a38890aa63d.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/logs/6b3e21ed-6beb-48a1-985d-8a38890aa63d.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 6b3e21ed-6beb-48a1-985d-8a38890aa63d
Post running
Toolkit status for 6b3e21ed-6beb-48a1-985d-8a38890aa63d is SUCCESS
Job Done: 6b3e21ed-6beb-48a1-985d-8a38890aa63d Final status: Done
But the train not launch the new train pod.
But well recived by the API
172.16.1.2 - - [01/Aug/2023:07:36:45 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/a47ffabc-e471-46c6-a361-1dd34be9b001/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
And after that the API pod get the status the Unhealthy.
Could you please elaborate more for above steps?
After the “dataset_convert” successfully, you run into the cell “Run train” for autoML training, right? Do you mean you cannot find a new pod when run “kubectl get pods” ? And then find the unhealthy info in “kubectl logs -f tao-toolkit-api-app-pod-xxxxx-xxxx” ?
Can’t never work using the Ingress-nginx port, neither with TAO4.
$ curl http://127.0.1.1:32080/api/v1
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
You can run
$ curl http://127.0.1.1:32080/api/v1/login/your_ngc_key
The same as the first post. NOW: Avoid the AutoML process. And avoid multi GPU.
1xGPU - Normal train - NO use_amp.
Login in the pod → reload the datasets (preview loaded in other projects) → create the automatic specs and personalize them → Run the convert dataset process → Watch how the new POD was created, read the log and verify that all the images are recogniced and the TFrecods generated → Create the new “model_id” → Create the automatic specs and personalize them → Review that make sense the generated file → Launch the train JOB ->#The POD is NOT created with the train process → The API POD go to Unhealthy and unreacheable.
Oh f**k, needs to revive the tao POD.
$ curl http://127.0.1.1:32080/api/v1/login/xxxxxxx
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx</center>
</body>
</html>
I don’t know how to revive the pod, without uninstall everything… try to drain the node. But continue unhealthy.
How about the result of
$ kubectl describe service tao-toolkit-api-service
$ kubectl describe service tao-toolkit-api-service
Name: tao-toolkit-api-service
Namespace: default
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: tao-toolkit-api
meta.helm.sh/release-namespace: default
Selector: name=tao-toolkit-api-app-pod
Type: NodePort
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.104.52.184
IPs: 10.104.52.184
Port: api 8000/TCP
TargetPort: 8000/TCP
NodePort: api 31951/TCP
Endpoints:
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
And the last describe from the API pod
$ kubectl describe pod tao-toolkit-api-app-pod-55c5d88d86-2xbm7
Name: tao-toolkit-api-app-pod-55c5d88d86-2xbm7
Namespace: default
Priority: 0
Node: azken/10.1.1.10
Start Time: Tue, 01 Aug 2023 10:17:49 +0200
Labels: name=tao-toolkit-api-app-pod
pod-template-hash=55c5d88d86
Annotations: cni.projectcalico.org/containerID: 4aac52a87cbc260611c0b3d1146e2d8f4ef30ee6c2ec948d09581207265c6fde
cni.projectcalico.org/podIP: 192.168.35.124/32
cni.projectcalico.org/podIPs: 192.168.35.124/32
Status: Running
IP: 192.168.35.124
IPs:
IP: 192.168.35.124
Controlled By: ReplicaSet/tao-toolkit-api-app-pod-55c5d88d86
Containers:
tao-toolkit-api-app:
Container ID: containerd://5a8f77c6358698d420a6db88ae57ff8d448d0f58c908acce43f6038cbf14cacc
Image: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
Image ID: nvcr.io/nvidia/tao/tao-toolkit@sha256:45e93283d23a911477cc433ec43e233af1631e85ec0ba839e63780c30dd2d70b
Port: 8000/TCP
Host Port: 0/TCP
State: Running
Started: Tue, 01 Aug 2023 10:38:54 +0200
Last State: Terminated
Reason: Unknown
Exit Code: 255
Started: Tue, 01 Aug 2023 10:19:05 +0200
Finished: Tue, 01 Aug 2023 10:36:06 +0200
Ready: False
Restart Count: 3
Liveness: http-get http://:8000/api/v1/health/liveness delay=3s timeout=3s period=10s #success=1 #failure=3
Readiness: http-get http://:8000/api/v1/health/readiness delay=3s timeout=3s period=10s #success=1 #failure=3
Environment:
NAMESPACE: default
CLAIMNAME: tao-toolkit-api-pvc
IMAGEPULLSECRET: imagepullsecret
AUTH_CLIENT_ID: bnSePYullXlG-504nOZeNAXemGF6DhoCdYR8ysm088w
NUM_GPU_PER_NODE: 1
BACKEND: local-k8s
IMAGE_TF1: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
IMAGE_PYT: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
IMAGE_TF2: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
IMAGE_TAO_DEPLOY: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy
IMAGE_DEFAULT: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
IMAGE_API: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
IMAGE_DATA_SERVICES: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-data-services
PYTHONIOENCODING: utf-8
LC_ALL: C.UTF-8
Mounts:
/shared from shared-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nw5ml (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
shared-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: tao-toolkit-api-pvc
ReadOnly: false
kube-api-access-nw5ml:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 58m (x2 over 60m) default-scheduler 0/1 nodes are available: 1 node(s) were unschedulable.
Normal Scheduled 58m default-scheduler Successfully assigned default/tao-toolkit-api-app-pod-55c5d88d86-2xbm7 to azken
Normal Pulled 57m kubelet Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 8.632331063s
Normal Started 57m (x2 over 57m) kubelet Started container tao-toolkit-api-app
Normal Created 57m (x2 over 57m) kubelet Created container tao-toolkit-api-app
Normal Pulled 57m kubelet Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 3.755364478s
Warning Unhealthy 57m (x6 over 57m) kubelet Liveness probe failed: Get "http://192.168.35.118:8000/api/v1/health/liveness": dial tcp 192.168.35.118:8000: connect: connection refused
Warning Unhealthy 57m (x8 over 57m) kubelet Readiness probe failed: Get "http://192.168.35.118:8000/api/v1/health/readiness": dial tcp 192.168.35.118:8000: connect: connection refused
Normal Killing 57m (x2 over 57m) kubelet Container tao-toolkit-api-app failed liveness probe, will be restarted
Normal Pulling 56m (x3 over 58m) kubelet Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api"
Warning Unhealthy 53m (x23 over 56m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 400
Normal SandboxChanged 37m (x2 over 37m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulling 37m kubelet Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api"
Normal Pulled 37m kubelet Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 3.505803367s
Normal Created 37m kubelet Created container tao-toolkit-api-app
Normal Started 37m kubelet Started container tao-toolkit-api-app
Warning Unhealthy 36m kubelet Readiness probe failed: Get "http://192.168.35.124:8000/api/v1/health/readiness": dial tcp 192.168.35.124:8000: connect: connection refused
Warning Unhealthy 36m kubelet Liveness probe failed: Get "http://192.168.35.124:8000/api/v1/health/liveness": dial tcp 192.168.35.124:8000: connect: connection refused
Warning Unhealthy 115s (x229 over 35m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 400
How about
$ kubectl get endpoints
After reinstall, can login using the nginx port.
But only can login → try to get the dataset default specs → DIE
After the last dead:
$ kubectl get endpoints
NAME ENDPOINTS AGE
cluster.local-nfs-subdir-external-provisioner <none> 161m
ingress-nginx-controller 192.168.35.78:443,192.168.35.78:80 161m
kubernetes 10.1.1.10:6443 168m
tao-toolkit-api-jupyterlab-service 192.168.35.74:8888 9m49s
tao-toolkit-api-service 9m49s
That is not expected to get empty endpoint for tao-toolkit-api-service .
The same unexpected result in Endpoints:
when run $kubectl describe service tao-toolkit-api-service
Yehp…but…
This is after a TAO-API helm uninstall and install.
$ kubectl get endpoints
NAME ENDPOINTS AGE
cluster.local-nfs-subdir-external-provisioner <none> 170m
ingress-nginx-controller 192.168.35.78:443,192.168.35.78:80 170m
kubernetes 10.1.1.10:6443 176m
tao-toolkit-api-jupyterlab-service 192.168.35.83:8888 38s
tao-toolkit-api-service 192.168.35.75:8000 38s
How can debug that?
Any form to enter to the pod and watch whats happening?
Suggest to narrow down to check which cell will result into this behavior.
You can try to run another notebook to check if TAO-API works.
For example, notebooks/tao_api_starter_kit/client/data_services.ipynb
I will try to execute the entire notebook…
These notebook works well. Create all the important pods when its necesary.
I will repeate the detecnet process…
One question.
I’m watching that in some notebooks when you do the login, in the API url point to the {namespace} of kubernetes, and in other not.
I’m testing and both forms are interpreted by the cluster. But which is the correct one?
WTF!!!
With the API notebook, the train process start correctly.
So some bug are hidden in the TAO-CLIENT.
The API POST to start the job was freeze from a minute, but the start. So the api-client maybe have some timeout that makes that the pod dead.
Note: I’m using the {namespace} in the TAO API url.
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/specs/695e9ea9-fa1c-4349-977a-e185fcd6f4a3.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/ --verbose --key=tlt_encode --gpus=1 > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/logs/695e9ea9-fa1c-4349-977a-e185fcd6f4a3.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/logs/695e9ea9-fa1c-4349-977a-e185fcd6f4a3.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 695e9ea9-fa1c-4349-977a-e185fcd6f4a3
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default 695e9ea9-fa1c-4349-977a-e185fcd6f4a3-86f5x 1/1 Running 0 56s
LOG:
INFO:tensorflow:Graph was finalized.
2023-08-02 07:06:47,473 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-08-02 07:06:49,919 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-08-02 07:06:50,496 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-08-02 07:06:58,853 [TAO Toolkit] [INFO] tensorflow 81: Saving checkpoints for step-0.
INFO:tensorflow:epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.107709154, step = 0
2023-08-02 07:07:54,302 [TAO Toolkit] [INFO] tensorflow 262: epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.107709154, step = 0
2023-08-02 07:07:54,311 [TAO Toolkit] [INFO] root 2102: None
2023-08-02 07:07:54,319 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.tfhooks.task_progress_monitor_hook 149: Epoch 0/100: loss: 0.10771 learning rate: 4.9999994e-06 Time taken: 0:00:00 ETA: 0:00:00
2023-08-02 07:07:54,319 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 1.003
INFO:tensorflow:epoch = 0.002008032128514056, learning_rate = 5.0046265e-06, loss = 0.107642695, step = 2 (8.946 sec)
2023-08-02 07:08:03,247 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.002008032128514056, learning_rate = 5.0046265e-06, loss = 0.107642695, step = 2 (8.946 sec)
INFO:tensorflow:epoch = 0.023092369477911646, learning_rate = 5.053455e-06, loss = 0.104957655, step = 23 (5.456 sec)
2023-08-02 07:08:08,703 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.023092369477911646, learning_rate = 5.053455e-06, loss = 0.104957655, step = 23 (5.456 sec)
2023-08-02 07:08:08,962 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 15.551
INFO:tensorflow:epoch = 0.04417670682730923, learning_rate = 5.1027596e-06, loss = 0.1001761, step = 44 (5.476 sec)
2023-08-02 07:08:14,178 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.04417670682730923, learning_rate = 5.1027596e-06, loss = 0.1001761, step = 44 (5.476 sec)
2023-08-02 07:08:15,494 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 91.857
INFO:tensorflow:epoch = 0.06526104417670682, learning_rate = 5.1525503e-06, loss = 0.0976929, step = 65 (5.520 sec)
2023-08-02 07:08:19,699 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.06526104417670682, learning_rate = 5.1525503e-06, loss = 0.0976929, step = 65 (5.520 sec)
2023-08-02 07:08:22,081 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 91.099
OK, glad to know it can run. I think you are running with notebooks/tao_api_starter_kit/api/object_detection.ipynb, right?
Did you install latest tao-client? Will monitor the feedback from users since I cannot reproduce this.