TAO API (kubernetes pod) troubleshooting: TAO API jobs stuck in "Pending" state indefinitely


The only thing is I did not add the bits for making the k8 secret (above) , earlier I had custom ingress classes etc. but to make the problem simpler I got rid of all that stuff. so the answer is “nothing that I’m aware of” :) (wonder if it is a cluster thing that prevents the api/workflow workers from getting rid of the completed container.

If this solution keeps working I’m happy to delete the completed pod manually for the time being. Hopefully this will get fixed in a later release?

Shall I accept the solution as it is and let you guys look into it in your own time?

Just to confirm, when you continue the notebook, if the cell does not run into “pending” , you need not delete the completed pod, right?

Yes! I don’t need to delete the “job-pod” manually unless the job is pending, however from what I’ve seen the next job will get stuck in pending if the previous “job-pod” is not deleted.

having said that since I’ve gotten to the point of model re-training, that means previous job-pods for preceding tasks such as tfrecords creation should have successfully got removed as intended (I don’t remember seeing more than one “completed” pod in the namespace of the tao toolkit ).


OK, so once meet the “pending” cell, please use this temporary workaround to unblock the notebook.

1 Like

Thanks! I will accept the answer (suggestion to delete the completed job-pod)!!
Cheers for the fantastic support!!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.