TAO API (kubernetes pod) troubleshooting: TAO API jobs stuck in "Pending" state indefinitely

,

Please provide the following information when requesting support.

• Hardware DGX Station A100 with k8
• Network Type SSD
• TLT Version 4.0.0 | kubectl version v1.24.14

Create K8 cluster with helm chart (one master node CPUserver, DGX Station with gpu-operator), folllow the instructions in ssd (with TAO API) until model creation and retraining.

Optionally I’ve used clearml local server for visualisation,

Everything runs well for a while but after a while (maybe after 2 runs) jobs get stuck in the pending state, I’ve tried creating new models, but ithe problem doesn;t go away.

I can set a task from pending to stopped but then the status goes to Error (instead of stopped) Then I can delete the task but after that all new tasks get stuck in the pending state.

Is there a way to check what is happenning (`kubectl logs command doesn’t give enough info)

Right ow the solution is to remove and redeploy the helm charts. Is this a known problem? os there a workaround?

# list the current jobs
endpoint = f"{base_url}/model/{model_ID}/job"
response = requests.get(endpoint, headers=headers, verify=rootca)
# print(response)




for job in response.json():
    table_headers = ["job id", "action", "status", "result", "epoch", "t_epoch" ,"message", "date"]
    if "detailed_status"  in job['result']:
        # print(json.dumps(job, sort_keys=True, indent=4))
        data = [(job['id'],
                job['action'],
                job['status'],
                job['result']['detailed_status']['status'],
                f"{job['result']['epoch']}/{job['result']['max_epoch']}",
                job['result']['time_per_epoch'],
                job['result']['detailed_status']['message'],
                f"[{job['result']['detailed_status']['date']}][{job['result']['detailed_status']['time']}]")]
        print(tabulate(data, headers=table_headers, tablefmt="grid"))
    else:
     table_headers = ["job id", "action", "status"]
     data = [(job['id'], job['action'], job['status'])]
     print(tabulate(data, headers=table_headers, tablefmt="grid"))

output:


+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+-----------------------+
| job id                               | action   | status   | result   | epoch   | t_epoch        | message                         | date                  |
+======================================+==========+==========+==========+=========+================+=================================+=======================+
| 675518bf-86eb-47ee-ba2e-b1088f3ba08e | train    | Done     | SUCCESS  | 6/6     | 0:02:05.933211 | Training finished successfully. | [6/12/2023][12:45:55] |
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+-----------------------+
+--------------------------------------+----------+----------+
| job id                               | action   | status   |
+======================================+==========+==========+
| 6aa7b64e-7b7d-4eaf-b036-9b42365846d7 | train    | Pending  |
+--------------------------------------+----------+----------+
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+----------------------+
| job id                               | action   | status   | result   | epoch   | t_epoch        | message                         | date                 |
+======================================+==========+==========+==========+=========+================+=================================+======================+
| 2b827b58-cc53-4141-a6fb-50dc4787b784 | train    | Done     | SUCCESS  | 6/6     | 0:02:05.295144 | Training finished successfully. | [5/4/2023][17:47:29] |
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+----------------------+
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+-----------------------+
| job id                               | action   | status   | result   | epoch   | t_epoch        | message                         | date                  |
+======================================+==========+==========+==========+=========+================+=================================+=======================+
| 310a19dc-1925-4cf0-bd4f-ad9dae50efb3 | train    | Done     | SUCCESS  | 20/20   | 0:02:13.025302 | Training finished successfully. | [6/13/2023][13:51:41] |
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+-----------------------+
+--------------------------------------+----------+----------+
| job id                               | action   | status   |
+======================================+==========+==========+
| ee58cad8-7cc6-47ae-8c61-c38ba08b93f4 | train    | Pending  |
+--------------------------------------+----------+----------+

After begining from a new model now everything is in the pending state

+--------------------------------------+----------+----------+
| job id                               | action   | status   |
+======================================+==========+==========+
| 109422cc-c8df-4231-be36-17be63116e74 | train    | Pending  |
+--------------------------------------+----------+----------+
+--------------------------------------+----------+----------+
| job id                               | action   | status   |
+======================================+==========+==========+
| 5a2c225a-1923-4ecb-895a-999e6e1e2d29 | train    | Pending  |
+--------------------------------------+----------+----------+
+--------------------------------------+----------+----------+
| job id                               | action   | status   |
+======================================+==========+==========+
| 0c4f97c9-93f0-422e-9ebb-837f5119f6d1 | train    | Pending  |
+--------------------------------------+----------+----------+
+--------------------------------------+----------+----------+
| job id                               | action   | status   |
+======================================+==========+==========+
| 971fe00a-5110-4d91-b2a6-5fb4e550b179 | train    | Error    |
+--------------------------------------+----------+----------+

I doublechecked and made sure that nothing is running in the background (GPU intensive such as training )

Tue Jun 13 19:22:30 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   31C    P0    51W / 275W |      4MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    52W / 275W |      4MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   31C    P0    53W / 275W |      4MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA DGX Display  On   | 00000000:C1:00.0 Off |                  N/A |
| 34%   39C    P8    N/A /  50W |      6MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:C2:00.0 Off |                    0 |
| N/A   31C    P0    51W / 275W |      4MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      6099      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      6099      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      6099      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      6099      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      6099      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
g@dgx:~$ 

No

Cheers

Did you use the latest 4.0.2 notebook?

For logs, how about
kubectl logs -f tao-toolkit-api-workflow-pod-xxxxxxxx-yyyy

I’ve used the helm chart

apiVersion: v1
appVersion: 4.0.2
description: TAO Toolkit API
name: tao-toolkit-api
version: 4.0.2

my values.yaml

# TAO Toolkit API container info
image: nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api
imagePullSecret: imagepullsecret
imagePullPolicy: Always 

# Optional proxy settings
#httpsProxy: http://10.194.54.59:3128
#myCertConfig: my-cert

# Optional HTTPS settings for ingress controller
#host: mydomain.com
#tlsSecret: tls-secret
#corsOrigin: https://mydomain.com

host: aisrv.gnet.lan
tlsSecret: tao-aisrv-gnet-secret


# Shared storage info
#storageClassName: nfs-client
storageClassName: local-storage-dgx
storageAccessMode: ReadWriteMany
storageSize: 200Gi
ephemeral-storage: 8Gi
limits.ephemeral-storage: 50Gi
requests.ephemeral-storage: 4Gi

# Optional NVIDIA Starfleet authentication
#authClientId: bnSePYullXlG-50.....

# Starting TAO Toolkit jobs info
backend: local-k8s
numGpus: 4
imageTf: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
imagePyt: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
imageDnv2: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
imageDefault: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5

# To opt out of providing anonymous telemetry data to NVIDIA
#telemetryOptOut: no

# Optional MLOPS setting for Weights And Biases
#wandbApiKey: cf23df2207d9...

# Optional MLOPS setting for ClearML
clearMlWebHost: http://clearml.gnet.lan:30080
clearMlApiHost: http://clearml.gnet.lan:30008
clearMlFilesHost: http://clearml.gnet.lan:30081
clearMlApiAccessKey: PHJ..
clearMlApiSecretKey: uojoj....

Here another intersting note (maybe not relevent to this case) is I can’t use the nfs-client storage with my NFS share which all our other helm charts work without a problem (I’ve tried changing kubectl versions but the liveliness probe works while the readiness probe for the toolkit app fails, somehow when i use local storage it magically works) : The DGX and the nfs server are in two subnets (but there are pods in dgx from other helm charts that use the nfs fine with no issues, I’ve also installed them in diffeent orders and in isolation e.g. install TAO toolkit first to no sucess)

I haven’t used a 4.0.2 Notebook per say but This is the general format I’ve been using (the datasets have been created and searialised in the 4.0.0 version, basically I mounted the old volume to the new helm chart (but the models are from the new 4.0.2 chart) )…

Here is a sample notebook for testing the code

The workflow pod had no real debugging (unless there is a way to change the verbosity level that I am unaware)

 kubectl logs -n tao-gnet tao-toolkit-api-workflow-pod-58bc86fc9-ngxz2 -f

output

NGC CLI 3.19.0
                                                    

The app pod however has some logs but they don;t seem to say much below is the command and output

command

 kubectl logs -n tao-gnet tao-toolkit-api-app-pod-6bf85c898-6qvfk  -f

some of the output (I didn’t inlcude all the probe requests and repetitions)


...
URL: https://aisrv.gnet.lan:30904/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/model/0c671979-5f80-4d61-96ed-ae0847a37e68/job
Token: ...H59cto0CMI
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.152 - - [14/Jun/2023:09:17:38 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.152 - - [14/Jun/2023:09:17:38 +0000] "POST /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/model/0c671979-5f80-4d61-96ed-ae0847a37e68/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
172.16.3.2 - - [14/Jun/2023:09:17:39 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:17:39 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:30904/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/model/0c671979-5f80-4d61-96ed-ae0847a37e68/job/9d53bf6b-11cf-4ad1-8995-fb0769a83032
Token: ...H59cto0CMI
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.152 - - [14/Jun/2023:09:17:47 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.152 - - [14/Jun/2023:09:17:47 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/model/0c671979-5f80-4d61-96ed-ae0847a37e68/job/9d53bf6b-11cf-4ad1-8995-fb0769a83032 HTTP/1.1" 200 270 "-" "python-requests/2.28.2"
172.16.3.2 - - [14/Jun/2023:09:17:49 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:17:49 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:30904/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/model/0c671979-5f80-4d61-96ed-ae0847a37e68/job/9d53bf6b-11cf-4ad1-8995-fb0769a83032
Token: ...H59cto0CMI
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.152 - - [14/Jun/2023:09:17:50 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.152 - - [14/Jun/2023:09:17:51 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/model/0c671979-5f80-4d61-96ed-ae0847a37e68/job/9d53bf6b-11cf-4ad1-8995-fb0769a83032 HTTP/1.1" 200 270 "-" "python-requests/2.28.2"
172.16.3.2 - - [14/Jun/2023:09:17:59 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:17:59 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:30904/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/model/0c671979-5f80-4d61-96ed-ae0847a37e68/job
Token: ...H59cto0CMI
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.152 - - [14/Jun/2023:09:17:59 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.152 - - [14/Jun/2023:09:17:59 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/model/0c671979-5f80-4d61-96ed-ae0847a37e68/job HTTP/1.1" 200 1085 "-" "python-requests/2.28.2"
172.16.3.2 - - [14/Jun/2023:09:18:09 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:09 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:19 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:19 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:29 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:29 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:39 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:39 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:49 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:49 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:59 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [14/Jun/2023:09:18:59 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"

...

I also redid the ssd.ipynb to demonstrate this (getting stuck in) further it seems once it happends the only solutiuon is reinstalling from scratch. I wonder if there is a way to get more debugging information about this pending state and reset without having to reinstall the helm charts.

I think you can delete the unnecessary pod via kubectl delete pod xx after checking via $ kubectl get pods . And run cell again.
If there is still pending, please check the log in tao-toolkit-api-workflow-pod

What do you mean by “unnecessary pod” ?

I’ve alreadt tried deleting the toolkit and api-workflow pods yesterday hoping that would fix it. (as you can see they are only 24 hrs old)

g@gsrv:~$ kubectl get pods -n tao-gnet
NAME                                           READY   STATUS      RESTARTS   AGE
310a19dc-1925-4cf0-bd4f-ad9dae50efb3-ctqp7     0/1     Completed   0          27h
ingress-nginx-controller-5cdbcc9966-b4nnj      1/1     Running     0          3d22h
tao-toolkit-api-app-pod-6bf85c898-6qvfk        1/1     Running     0          24h
tao-toolkit-api-workflow-pod-58bc86fc9-ngxz2   1/1     Running     0          24h

It is the pod which results in “pending” status.
OK, from your above screenshot, there is not other pod now.
So, could you trigger again to check if it is still pending?
If yes, please check the log for this new pod and also tao-toolkit-api-workflow-pod.

NO the pod is never in pending (if that was I’d try to treat it as a k8 problem before narrowing down to a tao-chart related issue tbh ) , its just the task status that gets stuck in pending. for example training or in the second case; dataset conversion. The API is responsive (As we can see in the gists I’ve attached above) and is not throwing any errors in logs as I’ve shown previously. (not sure if there is a way to invoke more logging))

To be clear if I interrogate the task list for a model they are still stuck in the pending state while the pod is running. (the pods are sheduled in the DGX station A100 node so I belive they have more than enough memory and CPU, I’ve been keepoing an eye on the resources nevertheless)

In the fresh notebook) the task is stuck for 90+ minutes. (there was no change in the pod running state)

is there anything I can do by opening a shell in ther api-pod or the workflow pod? for example use

kubectl exec -it --namespace tao-gnet tao-toolkit-api-app-pod-6bf85c898-6qvfk  -- /bin/bash

Could you share the latest result of
$ kubectl get pods

More, is it possible to open another notebook, for example, retinanet.ipynb to check there is the same “pending” issue?

Sure,

g@gsrv:~/Workspace/sandbox/TAO3$ kubectl get pods -A 
NAMESPACE              NAME                                                              READY   STATUS      RESTARTS        AGE
calico-apiserver       calico-apiserver-64bf4f7f44-bnfk8                                 1/1     Running     2 (3d4h ago)    4d23h
calico-apiserver       calico-apiserver-64bf4f7f44-m8mc2                                 1/1     Running     2 (3d4h ago)    4d23h
calico-system          calico-kube-controllers-55f86d5fdf-4jfk6                          1/1     Running     2 (3d4h ago)    4d23h
calico-system          calico-node-59pj2                                                 1/1     Running     0               4d23h
calico-system          calico-node-h8w6v                                                 1/1     Running     2 (3d4h ago)    4d23h
calico-system          calico-typha-5b58978b9f-cqrnd                                     1/1     Running     3 (3d4h ago)    4d23h
calico-system          csi-node-driver-8r2b8                                             2/2     Running     0               4d23h
calico-system          csi-node-driver-xkcx4                                             2/2     Running     4 (3d4h ago)    4d23h
clearml                clearml-apiserver-76ff97d7f7-q4qbn                                1/1     Running     0               4d19h
clearml                clearml-elastic-master-0                                          1/1     Running     0               4d19h
clearml                clearml-fileserver-ff756c4b8-8f4nd                                1/1     Running     0               4d19h
clearml                clearml-mongodb-5f9468969b-zgbxf                                  1/1     Running     2 (3d4h ago)    4d19h
clearml                clearml-redis-master-0                                            1/1     Running     0               4d19h
clearml                clearml-webserver-7f5fb5df5d-g5kxm                                1/1     Running     0               4d19h
gpu-operator           gpu-feature-discovery-qddx5                                       1/1     Running     0               4d23h
gpu-operator           gpu-operator-1686418997-node-feature-discovery-master-5779828t8   1/1     Running     2 (3d4h ago)    4d23h
gpu-operator           gpu-operator-1686418997-node-feature-discovery-worker-jhpjz       1/1     Running     6 (3d4h ago)    4d23h
gpu-operator           gpu-operator-6dff6b976c-5nsxq                                     1/1     Running     10 (3d4h ago)   4d23h
gpu-operator           nvidia-cuda-validator-q6msm                                       0/1     Completed   0               4d23h
gpu-operator           nvidia-dcgm-exporter-x5hn7                                        1/1     Running     0               4d23h
gpu-operator           nvidia-device-plugin-daemonset-r9gmq                              1/1     Running     0               4d23h
gpu-operator           nvidia-device-plugin-validator-l6p9q                              0/1     Completed   0               4d23h
gpu-operator           nvidia-mig-manager-fvn2x                                          1/1     Running     0               4d23h
gpu-operator           nvidia-operator-validator-vdk78                                   1/1     Running     0               4d23h
k8-storage             nfs-subdir-external-provisioner-5669cc5b6-jzhrj                   1/1     Running     0               6h19m
kube-system            coredns-57575c5f89-bfclk                                          1/1     Running     2 (3d4h ago)    4d23h
kube-system            coredns-57575c5f89-vnm8v                                          1/1     Running     2 (3d4h ago)    4d23h
kube-system            etcd-gsrv                                                         1/1     Running     2 (3d4h ago)    4d23h
kube-system            kube-apiserver-gsrv                                               1/1     Running     2 (3d4h ago)    4d23h
kube-system            kube-controller-manager-gsrv                                      1/1     Running     2 (3d4h ago)    4d23h
kube-system            kube-proxy-65t82                                                  1/1     Running     0               4d23h
kube-system            kube-proxy-n8ncf                                                  1/1     Running     2 (3d4h ago)    4d23h
kube-system            kube-scheduler-gsrv                                               1/1     Running     2 (3d4h ago)    4d23h
kubernetes-dashboard   kubernetes-dashboard-64bd57954b-n4fbx                             1/1     Running     4 (3d4h ago)    4d20h
tao-gnet               ingress-nginx-admission-create-666nh                              0/1     Completed   0               86s
tao-gnet               ingress-nginx-admission-patch-79wkc                               0/1     Completed   0               81s
tao-gnet               ingress-nginx-controller-5cdbcc9966-7bqvh                         1/1     Running     0               81s
tao-gnet               tao-toolkit-api-app-pod-6bf85c898-pmt56                           1/1     Running     0               28s
tao-gnet               tao-toolkit-api-workflow-pod-58bc86fc9-kvfbv                      1/1     Running     0               28s
tigera-operator        tigera-operator-649dd7bf97-k2tz7                                  1/1     Running     2 (3d4h ago)    4d23h

This is a two node cluster
master: fujitsu primergy
worker: dgx station A100

I will include output of ‘get services’ here as well (in case it is helpful)

g@gsrv:~/Workspace/sandbox/TAO3$ kubectl get svc -A 
NAMESPACE              NAME                                                    TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
calico-apiserver       calico-api                                              ClusterIP      10.103.34.102    <none>        443/TCP                      4d23h
calico-system          calico-kube-controllers-metrics                         ClusterIP      None             <none>        9094/TCP                     4d23h
calico-system          calico-typha                                            ClusterIP      10.109.64.46     <none>        5473/TCP                     4d23h
clearml                clearml-apiserver                                       NodePort       10.96.186.212    <none>        8008:30008/TCP               4d19h
clearml                clearml-elastic-master                                  ClusterIP      10.101.153.101   <none>        9200/TCP,9300/TCP            4d19h
clearml                clearml-elastic-master-headless                         ClusterIP      None             <none>        9200/TCP,9300/TCP            4d19h
clearml                clearml-fileserver                                      NodePort       10.101.179.43    <none>        8081:30081/TCP               4d19h
clearml                clearml-mongodb                                         ClusterIP      10.109.28.109    <none>        27017/TCP                    4d19h
clearml                clearml-redis-headless                                  ClusterIP      None             <none>        6379/TCP                     4d19h
clearml                clearml-redis-master                                    ClusterIP      10.98.86.172     <none>        6379/TCP                     4d19h
clearml                clearml-webserver                                       NodePort       10.103.248.172   <none>        8080:30080/TCP               4d19h
default                kubernetes                                              ClusterIP      10.96.0.1        <none>        443/TCP                      4d23h
gpu-operator           gpu-operator                                            ClusterIP      10.97.172.237    <none>        8080/TCP                     4d23h
gpu-operator           gpu-operator-1686418997-node-feature-discovery-master   ClusterIP      10.99.197.123    <none>        8080/TCP                     4d23h
gpu-operator           nvidia-dcgm-exporter                                    ClusterIP      10.105.229.228   <none>        9400/TCP                     4d23h
kube-system            kube-dns                                                ClusterIP      10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP       4d23h
kubernetes-dashboard   kubernetes-dashboard                                    NodePort       10.110.122.192   <none>        443:30001/TCP                4d20h
tao-gnet               ingress-nginx-controller                                LoadBalancer   10.107.48.176    <pending>     80:30027/TCP,443:32091/TCP   3m42s
tao-gnet               ingress-nginx-controller-admission                      ClusterIP      10.109.239.201   <none>        443/TCP                      3m42s
tao-gnet               tao-toolkit-api-service                                 ClusterIP      10.103.158.223   <none>        8000/TCP                     2m49s

Thanks a lot for looking into this!
Cheers,
Ganindu.

I recall that you can run TAO API successfully with your existing environment several months ago. So, I still need to understand more for the new problem you meet now. Why previously works but currently not work? In other words, how can I reproduce this issue or what has been done on your side before getting this issue in ssd.ipynb?
To narrow down, is it possible to open another notebook, for example, retinanet.ipynb to check there is the same “pending” issue?

Update: It is a random issue for old chart.
Please use latest 4.0.2 tao-api.tgz.

helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz --username=‘$oauthtoken’ --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.2.tgz -C tao-toolkit-api

Thanks a lot!! I will try this, is there a kubernetes version that is recommended for this?

HI,
It worked until I came up to dataset convert action for the eval dataset, then went back to “pending” state (I have attached the retinanet notebook)

Cheers,
Ganindu.

retinanet.ipynb (63 KB)

Sorry to hear that there is still “pending” state.
Could you please share the logs for “pending” pod 690f86dc-5e77-44c1-8612-ecffa64dea06 ?
Other logs are also appreciated. Such as
$ kubectl log -f tao-toolkit-api-workflow-xxx
$ kubectl log -f tao-toolkit-api-app-pod-xxx
$ kubectl describe pod 690f86dc-5e77-44c1-8612-ecffa64dea06
$ kubectl describe pod tao-toolkit-api-workflow-xxx
$ kubectl describe pod tao-toolkit-api-app-pod-xxx

Thanks.

Thanks a lot for the quick response!!

Just to confirm: The pods are not in pending state it is just the TAO jobs that are pending.
confirming below pods are not pending and I’m using chart version 4.0.2

g@gsrv:~$ kubectl get pods -n tao-gnet
NAME                                           READY   STATUS      RESTARTS       AGE
e1bf0379-105e-4353-a026-6bf405b2cb62-jzcg8     0/1     Completed   0              42m
ingress-nginx-controller-5cdbcc9966-7bqvh      1/1     Running     1 (4d5h ago)   6d20h
tao-toolkit-api-app-pod-6bf85c898-pmt56        1/1     Running     1 (4d5h ago)   6d20h
tao-toolkit-api-workflow-pod-58bc86fc9-kvfbv   1/1     Running     1 (4d5h ago)   6d20h
g@gsrv:~$ helm list -n tao-gnet
NAME           	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART                	APP VERSION
ingress-nginx  	tao-gnet 	1       	2023-06-15 17:03:15.735616432 +0000 UTC	deployed	ingress-nginx-4.7.0  	1.8.0      
tao-toolkit-api	tao-gnet 	2       	2023-06-22 11:53:32.29973565 +0000 UTC 	deployed	tao-toolkit-api-4.0.2	4.0.2 

I can’t see a pod 690f86dc-5e77-44c1-8612-ecffa64dea06 I think you are referring to the job id of te dataset conversion task for the eval dataset

here is a list of all the pods in the TAO test k8 cluster

g@gsrv:~$ kubectl get pods -A 
NAMESPACE              NAME                                                              READY   STATUS      RESTARTS        AGE
calico-apiserver       calico-apiserver-64bf4f7f44-bnfk8                                 1/1     Running     4 (3d2h ago)    11d
calico-apiserver       calico-apiserver-64bf4f7f44-m8mc2                                 1/1     Running     4 (3d2h ago)    11d
calico-system          calico-kube-controllers-55f86d5fdf-4jfk6                          1/1     Running     4 (3d2h ago)    11d
calico-system          calico-node-59pj2                                                 1/1     Running     1 (4d5h ago)    11d
calico-system          calico-node-h8w6v                                                 1/1     Running     4 (3d2h ago)    11d
calico-system          calico-typha-5b58978b9f-cqrnd                                     1/1     Running     6 (3d2h ago)    11d
calico-system          csi-node-driver-8r2b8                                             2/2     Running     2 (4d5h ago)    11d
calico-system          csi-node-driver-xkcx4                                             2/2     Running     8 (3d2h ago)    11d
clearml                clearml-apiserver-76ff97d7f7-bmjkj                                1/1     Running     0               2d23h
clearml                clearml-elastic-master-0                                          1/1     Running     0               2d23h
clearml                clearml-fileserver-ff756c4b8-tnslk                                1/1     Running     0               2d23h
clearml                clearml-mongodb-5f9468969b-96llz                                  1/1     Running     0               2d23h
clearml                clearml-redis-master-0                                            1/1     Running     0               2d23h
clearml                clearml-webserver-7f5fb5df5d-kttc7                                1/1     Running     0               2d23h
gpu-operator           gpu-feature-discovery-qddx5                                       1/1     Running     1 (4d5h ago)    11d
gpu-operator           gpu-operator-1686418997-node-feature-discovery-master-5779828t8   1/1     Running     4 (3d2h ago)    11d
gpu-operator           gpu-operator-1686418997-node-feature-discovery-worker-jhpjz       1/1     Running     10 (3d2h ago)   11d
gpu-operator           gpu-operator-6dff6b976c-5nsxq                                     1/1     Running     16 (3d2h ago)   11d
gpu-operator           nvidia-cuda-validator-pjvpw                                       0/1     Completed   0               4d5h
gpu-operator           nvidia-dcgm-exporter-x5hn7                                        1/1     Running     1 (4d5h ago)    11d
gpu-operator           nvidia-device-plugin-daemonset-r9gmq                              1/1     Running     1 (4d5h ago)    11d
gpu-operator           nvidia-device-plugin-validator-gxbp2                              0/1     Completed   0               4d5h
gpu-operator           nvidia-mig-manager-fvn2x                                          1/1     Running     1 (4d5h ago)    11d
gpu-operator           nvidia-operator-validator-vdk78                                   1/1     Running     1 (4d5h ago)    11d
k8-storage             nfs-subdir-external-provisioner-5669cc5b6-jzhrj                   1/1     Running     5 (3d2h ago)    7d3h
kube-system            coredns-57575c5f89-bfclk                                          1/1     Running     4 (3d2h ago)    11d
kube-system            coredns-57575c5f89-vnm8v                                          1/1     Running     4 (3d2h ago)    11d
kube-system            etcd-gsrv                                                         1/1     Running     7 (3d2h ago)    11d
kube-system            kube-apiserver-gsrv                                               1/1     Running     7 (3d2h ago)    11d
kube-system            kube-controller-manager-gsrv                                      1/1     Running     4 (3d2h ago)    11d
kube-system            kube-proxy-65t82                                                  1/1     Running     1 (4d5h ago)    11d
kube-system            kube-proxy-n8ncf                                                  1/1     Running     4 (3d2h ago)    11d
kube-system            kube-scheduler-gsrv                                               1/1     Running     4 (3d2h ago)    11d
kubernetes-dashboard   kubernetes-dashboard-64bd57954b-n4fbx                             1/1     Running     8 (3d2h ago)    11d
tao-gnet               e1bf0379-105e-4353-a026-6bf405b2cb62-jzcg8                        0/1     Completed   0               51m
tao-gnet               ingress-nginx-controller-5cdbcc9966-7bqvh                         1/1     Running     1 (4d5h ago)    6d20h
tao-gnet               tao-toolkit-api-app-pod-6bf85c898-pmt56                           1/1     Running     1 (4d5h ago)    6d20h
tao-gnet               tao-toolkit-api-workflow-pod-58bc86fc9-kvfbv                      1/1     Running     1 (4d5h ago)    6d20h
tigera-operator        tigera-operator-649dd7bf97-k2tz7                                  1/1     Running     6 (3d2h ago)    11d

last 100 lines of the toolkit api pod (scheduled in tao-toolkit-api-pod-877*)

 kubectl logs --tail=100 -n tao-gnet tao-toolkit-api-app-pod-6bf85c898-pmt56
192.168.251.141 - - [22/Jun/2023:13:37:03 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:37:03 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:37:06 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:37:06 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:37:16 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:37:16 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:37:18 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:37:18 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:37:26 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:37:26 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:37:33 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:37:33 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:37:36 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:37:36 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:37:46 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:37:46 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:37:48 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:37:48 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:37:56 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:37:56 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:38:04 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:38:04 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:38:06 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:38:06 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:38:16 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:38:16 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:38:19 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:38:19 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:38:26 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:38:26 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:38:34 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:38:34 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:38:36 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:38:36 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:38:46 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:38:46 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:38:49 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:38:49 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:38:56 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:38:56 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:39:04 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:39:04 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:39:06 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:39:06 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:39:16 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:39:16 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:39:19 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:39:19 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:39:26 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:39:26 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:39:34 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:39:34 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:39:36 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:39:36 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:39:46 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:39:46 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:39:49 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:39:49 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:39:56 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:39:56 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
URL: https://aisrv.gnet.lan:32091/tao-gnet/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06
Token: ...EE0SS5UnTk
Found trusted user: f2d3c55a-f3dd-5dff-badc-851e27460122
192.168.251.141 - - [22/Jun/2023:13:40:04 +0000] "GET /api/v1/auth HTTP/1.1" 200 122 "-" "python-requests/2.28.2"
192.168.251.141 - - [22/Jun/2023:13:40:04 +0000] "GET /api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/444045d0-73cf-44df-bca4-ae8d6fc1c638/job/690f86dc-5e77-44c1-8612-ecffa64dea06 HTTP/1.1" 200 282 "-" "python-requests/2.28.2"
172.16.3.2 - - [22/Jun/2023:13:40:06 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.24"
172.16.3.2 - - [22/Jun/2023:13:40:06 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.24"

logs for the workflow pod

kubectl logs -f -n tao-gnet tao-toolkit-api-workflow-pod-58bc86fc9-kvfbv
NGC CLI 3.19.0
ssd dataset_convert --results_dir /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62 --output_filename /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/specs/e1bf0379-105e-4353-a026-6bf405b2cb62.yaml  > /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/logs/e1bf0379-105e-4353-a026-6bf405b2cb62.txt 2>&1 >> /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/logs/e1bf0379-105e-4353-a026-6bf405b2cb62.txt; find /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62 -type d | xargs chmod 777; find /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62 -type f | xargs chmod 666 /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62/status.json
nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Job created e1bf0379-105e-4353-a026-6bf405b2cb62
Post running
Job Done: e1bf0379-105e-4353-a026-6bf405b2cb62 Final status: Done

Note: Here we can see the job done for the dataset conversion job (e1bf0379-105e-4353-a026-6bf405b2cb62) for the training dataset (look at at the picture below )


job completion confirmation (matching the workflow log)

After that there is no log entry for the dataset conversion job for the eval dataset (690f86dc-5e77-44c1-8612-ecffa64dea06) in the workflow log. or the POST call for execution, I think this is interesting. (maybe a symptom)

describing the toolkit api pod

g@gsrv:~$ kubectl describe pod  -n tao-gnet tao-toolkit-api-app-pod-6bf85c898-pmt56
Name:         tao-toolkit-api-app-pod-6bf85c898-pmt56
Namespace:    tao-gnet
Priority:     0
Node:         dgx/172.16.3.2
Start Time:   Thu, 15 Jun 2023 17:04:14 +0000
Labels:       name=tao-toolkit-api-app-pod
              pod-template-hash=6bf85c898
Annotations:  cni.projectcalico.org/containerID: f8fea9237384f3534989a07d9394214f80cf18c126d3b49883cf1ab7ca401df1
              cni.projectcalico.org/podIP: 192.168.251.151/32
              cni.projectcalico.org/podIPs: 192.168.251.151/32
Status:       Running
IP:           192.168.251.151
IPs:
  IP:           192.168.251.151
Controlled By:  ReplicaSet/tao-toolkit-api-app-pod-6bf85c898
Containers:
  tao-toolkit-api-app:
    Container ID:   containerd://18ce16ef81ec5313ae6752435fa0265aa99e38305e63aa0be99de38abb5df683
    Image:          nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api
    Image ID:       nvcr.io/nvidia/tao/tao-toolkit@sha256:44ee1bd26dd9b0122c83f3cc5bb2d82c105490a4c8b5a72de95ae0310dad3efe
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sun, 18 Jun 2023 08:32:29 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Thu, 15 Jun 2023 17:04:17 +0000
      Finished:     Sun, 18 Jun 2023 08:30:46 +0000
    Ready:          True
    Restart Count:  1
    Liveness:       http-get http://:8000/api/v1/health/liveness delay=3s timeout=3s period=10s #success=1 #failure=3
    Readiness:      http-get http://:8000/api/v1/health/readiness delay=3s timeout=3s period=10s #success=1 #failure=3
    Environment:
      NAMESPACE:        tao-gnet
      CLAIMNAME:        tao-toolkit-api-pvc
      IMAGEPULLSECRET:  imagepullsecret
      AUTH_CLIENT_ID:   bnSePYullXlG-504nOZeNAXemGF6DhoCdYR8ysm088w
      NUM_GPUS:         4
      BACKEND:          local-k8s
      IMAGE_TF:         nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
      IMAGE_PYT:        nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
      IMAGE_DNV2:       nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
      IMAGE_DEFAULT:    nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
      IMAGE_API:        nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api
    Mounts:
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4lzzt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  kube-api-access-4lzzt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

describing the workflow pod

kubectl describe pod  -n tao-gnet tao-toolkit-api-workflow-pod-58bc86fc9-kvfbv
Name:         tao-toolkit-api-workflow-pod-58bc86fc9-kvfbv
Namespace:    tao-gnet
Priority:     0
Node:         dgx/172.16.3.2
Start Time:   Thu, 15 Jun 2023 17:04:14 +0000
Labels:       name=tao-toolkit-api-workflow-pod
              pod-template-hash=58bc86fc9
Annotations:  cni.projectcalico.org/containerID: 738c3d0312acd3cafcfb0df57d4ec8044ecd6fa5e2d648a5fa0a551c2042ba59
              cni.projectcalico.org/podIP: 192.168.251.170/32
              cni.projectcalico.org/podIPs: 192.168.251.170/32
Status:       Running
IP:           192.168.251.170
IPs:
  IP:           192.168.251.170
Controlled By:  ReplicaSet/tao-toolkit-api-workflow-pod-58bc86fc9
Containers:
  tao-toolkit-api-workflow:
    Container ID:  containerd://6b246c5d2ec7370e5b85963798d19ee0cc51467224df314e882cfddec9cb7036
    Image:         nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api
    Image ID:      nvcr.io/nvidia/tao/tao-toolkit@sha256:44ee1bd26dd9b0122c83f3cc5bb2d82c105490a4c8b5a72de95ae0310dad3efe
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      workflow_start.sh
    State:          Running
      Started:      Sun, 18 Jun 2023 08:32:45 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Thu, 15 Jun 2023 17:04:18 +0000
      Finished:     Sun, 18 Jun 2023 08:30:46 +0000
    Ready:          True
    Restart Count:  1
    Environment:
      NAMESPACE:               tao-gnet
      CLAIMNAME:               tao-toolkit-api-pvc
      IMAGEPULLSECRET:         imagepullsecret
      NUM_GPUS:                4
      TELEMETRY_OPT_OUT:       no
      BACKEND:                 local-k8s
      IMAGE_TF:                nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
      IMAGE_PYT:               nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
      IMAGE_DNV2:              nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
      IMAGE_DEFAULT:           nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
      IMAGE_API:               nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api
      WANDB_API_KEY:           
      CLEARML_WEB_HOST:        http://clearml.gnet.lan:30080
      CLEARML_API_HOST:        http://clearml.gnet.lan:30008
      CLEARML_FILES_HOST:      http://clearml.gnet.lan:30081
      CLEARML_API_ACCESS_KEY:  PG1LOUANR46HUDPF2YXS
      CLEARML_API_SECRET_KEY:  o05lPvE48ayokDD8pyGgOudw1ZD5YMkDYCxMdtMb3U9DoX6vxJ
    Mounts:
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5lh46 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  kube-api-access-5lh46:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Thanks again for all the help, hope this information is useful

Cheers,
Ganindu.

May I know what is above pod? Could you describe it and also get the log?

looks like it was created for the dataset conversion job for the training set (e1bf0379-105e-4353-a026-6bf405b2cb62) which has completed successfully as I described above with the two screenshots and the log excerpt from the workflow pod. looks like it did the job and terminated successfully

I reckon there should’ve been another pod like this for the second job which never got created?

g@gsrv:~$ kubectl describe pod -n tao-gnet e1bf0379-105e-4353-a026-6bf405b2cb62-jzcg8
Name:         e1bf0379-105e-4353-a026-6bf405b2cb62-jzcg8
Namespace:    tao-gnet
Priority:     0
Node:         dgx/172.16.3.2
Start Time:   Thu, 22 Jun 2023 12:53:20 +0000
Labels:       controller-uid=ff3ae2e6-f63d-4529-a790-8b20e8ee181e
              job-name=e1bf0379-105e-4353-a026-6bf405b2cb62
              purpose=tao-toolkit-job
Annotations:  cni.projectcalico.org/containerID: dfb6eed26d478721c70cbb1419a0d0da87ca4e785716cb6c6803bf38492742fa
              cni.projectcalico.org/podIP: 
              cni.projectcalico.org/podIPs: 
Status:       Succeeded
IP:           192.168.251.191
IPs:
  IP:           192.168.251.191
Controlled By:  Job/e1bf0379-105e-4353-a026-6bf405b2cb62
Containers:
  container:
    Container ID:  containerd://bf8e2606576f391e3284194dc91cd2e941e2172d25af3d930cb4bf9c12869495
    Image:         nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
    Image ID:      nvcr.io/nvidia/tao/tao-toolkit@sha256:6282b5b09220942e321a452109ad40cde47e5e490480c405c92b930fff2b0574
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      umask 0 && ssd dataset_convert --results_dir /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62 --output_filename /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/specs/e1bf0379-105e-4353-a026-6bf405b2cb62.yaml  > /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/logs/e1bf0379-105e-4353-a026-6bf405b2cb62.txt 2>&1 >> /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/logs/e1bf0379-105e-4353-a026-6bf405b2cb62.txt; find /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62 -type d | xargs chmod 777; find /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62 -type f | xargs chmod 666
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 22 Jun 2023 12:53:21 +0000
      Finished:     Thu, 22 Jun 2023 12:53:39 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:
      NUM_GPUS:                1
      TELEMETRY_OPT_OUT:       no
      WANDB_API_KEY:           
      CLEARML_WEB_HOST:        http://clearml.gnet.lan:30080
      CLEARML_API_HOST:        http://clearml.gnet.lan:30008
      CLEARML_FILES_HOST:      http://clearml.gnet.lan:30081
      CLEARML_API_ACCESS_KEY:  PG1LOUANR46HUDPF2YXS
      CLEARML_API_SECRET_KEY:  o05lPvE48ayokDD8pyGgOudw1ZD5YMkDYCxMdtMb3U9DoX6vxJ
    Mounts:
      /dev/shm from dshm (rw)
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c24x9 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-c24x9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

No logs

g@gsrv:~$ kubectl logs -f  -n tao-gnet e1bf0379-105e-4353-a026-6bf405b2cb62-jzcg8

Could you delete the pod e1bf0379-105e-4353-a026-6bf405b2cb62-jzcg8 and then try to run below two cells for eval-dataset? Before that, please click the ■ “stop” button to interrupt the cell execution which is running into “pending” status.

1 Like

yup that worked!!! as soon as I deleted that pod (e1bf0379-105e-4353-a026-6bf405b2cb62-jzcg8) another pod (690f86dc-5e77-44c1-8612-ecffa64dea06-7phqk) got created automagically and finished the job!! :D

g@gsrv:~$ kubectl get pods -n tao-gnet 
NAME                                           READY   STATUS      RESTARTS       AGE
690f86dc-5e77-44c1-8612-ecffa64dea06-7phqk     0/1     Completed   0              3m11s
ingress-nginx-controller-5cdbcc9966-7bqvh      1/1     Running     1 (4d6h ago)   6d21h
tao-toolkit-api-app-pod-6bf85c898-pmt56        1/1     Running     1 (4d6h ago)   6d21h
tao-toolkit-api-workflow-pod-58bc86fc9-kvfbv   1/1     Running     1 (4d6h ago)   6d21h

workflow logs also confirm this

g@gsrv:~$ kubectl logs -n tao-gnet tao-toolkit-api-workflow-pod-58bc86fc9-kvfbv
NGC CLI 3.19.0
ssd dataset_convert --results_dir /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62 --output_filename /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/specs/e1bf0379-105e-4353-a026-6bf405b2cb62.yaml  > /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/logs/e1bf0379-105e-4353-a026-6bf405b2cb62.txt 2>&1 >> /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/logs/e1bf0379-105e-4353-a026-6bf405b2cb62.txt; find /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62 -type d | xargs chmod 777; find /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62 -type f | xargs chmod 666 /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/3eaaacc8-82c6-4e7a-bdd3-aab43caeb39a/e1bf0379-105e-4353-a026-6bf405b2cb62/status.json
nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Job created e1bf0379-105e-4353-a026-6bf405b2cb62
Post running
Job Done: e1bf0379-105e-4353-a026-6bf405b2cb62 Final status: Done
ssd dataset_convert --results_dir /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/444045d0-73cf-44df-bca4-ae8d6fc1c638/690f86dc-5e77-44c1-8612-ecffa64dea06 --output_filename /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/444045d0-73cf-44df-bca4-ae8d6fc1c638/tfrecords/tfrecords --verbose --dataset_export_spec /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/444045d0-73cf-44df-bca4-ae8d6fc1c638/specs/690f86dc-5e77-44c1-8612-ecffa64dea06.yaml  > /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/444045d0-73cf-44df-bca4-ae8d6fc1c638/logs/690f86dc-5e77-44c1-8612-ecffa64dea06.txt 2>&1 >> /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/444045d0-73cf-44df-bca4-ae8d6fc1c638/logs/690f86dc-5e77-44c1-8612-ecffa64dea06.txt; find /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/444045d0-73cf-44df-bca4-ae8d6fc1c638/690f86dc-5e77-44c1-8612-ecffa64dea06 -type d | xargs chmod 777; find /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/444045d0-73cf-44df-bca4-ae8d6fc1c638/690f86dc-5e77-44c1-8612-ecffa64dea06 -type f | xargs chmod 666 /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/444045d0-73cf-44df-bca4-ae8d6fc1c638/690f86dc-5e77-44c1-8612-ecffa64dea06/status.json
nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Job created 690f86dc-5e77-44c1-8612-ecffa64dea06
Post running
Job Done: 690f86dc-5e77-44c1-8612-ecffa64dea06 Final status: Done

Hopefully this means as long as delete the completed pod I can keep using the tao-tookit!!

However is this happening because of something I did, I changed the values yaml slightly becaue the toolkit never worked with my NFS storage

# TAO Toolkit API container info
image: nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api
imagePullSecret: imagepullsecret
imagePullPolicy: Always 

# Optional proxy settings
#httpsProxy: http://10.194.54.59:3128
#myCertConfig: my-cert

# Optional HTTPS settings for ingress controller
#host: mydomain.com
#tlsSecret: tls-secret
#corsOrigin: https://mydomain.com

host: aisrv.gnet.lan
tlsSecret: tao-aisrv-gnet-secret


# Shared storage info
#storageClassName: nfs-client
storageClassName: tao-nfs-storage
storageAccessMode: ReadWriteMany
storageSize: 100Gi
ephemeral-storage: 8Gi
limits.ephemeral-storage: 8Gi
requests.ephemeral-storage: 4Gi

# Optional NVIDIA Starfleet authentication
#authClientId: bnSePYullXlG-504nOZeNAXemGF6DhoCdYR8ysm088w

# Starting TAO Toolkit jobs info
backend: local-k8s
numGpus: 4
imageTf: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
imagePyt: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
imageDnv2: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
imageDefault: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5

# To opt out of providing anonymous telemetry data to NVIDIA
#telemetryOptOut: no

# Optional MLOPS setting for Weights And Biases
#wandbApiKey: cf23df2207d99a74fbe169e3eba035e633b63d13

# Optional MLOPS setting for ClearML
clearMlWebHost: http://clearml.gnet.lan:30080
clearMlApiHost: http://clearml.gnet.lan:30008
clearMlFilesHost: http://clearml.gnet.lan:30081
clearMlApiAccessKey: ..
clearMlApiSecretKey: ...

my pv

apiVersion: v1
kind: PersistentVolume
metadata:
  name: dgx-local-tao-pv
spec:
  storageClassName: tao-nfs-storage
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /mnt/tao-local-dgx
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - dgx
  persistentVolumeReclaimPolicy: Retain
  volumeMode: Filesystem

helm command for the nfs storage (which works for everything except tao tookit )

helm upgrade --install  -n tao-gnet --create-namespace nfs-subdir-external-provisioner-tao nfs-subdir-external-provisioner/nfs-subdir-external-provisioner --set nfs.server=172.16.1.22 --set nfs.path=/taopv --set storageClass.name=tao-nfs-storage --set storageClass.accessModes=ReadWriteMany

storage class (for the above chart)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: tao-nfs-storage
provisioner: cluster.local/gnet-tao-nfs-client-provisioner
parameters:
  archiveOnDelete: "false"

TL;DR is there a way to make things work without having to delete the completed pod manually all the time? (e.g. maybe persistentVolumeReclaimPolicy?)

Besides values.yaml , is there any other change? I am not sure the status for this kind of corner case yet.