The only thing is I did not add the bits for making the k8 secret (above) , earlier I had custom ingress classes etc. but to make the problem simpler I got rid of all that stuff. so the answer is “nothing that I’m aware of” :) (wonder if it is a cluster thing that prevents the api/workflow workers from getting rid of the completed container.
If this solution keeps working I’m happy to delete the completed pod manually for the time being. Hopefully this will get fixed in a later release?
Shall I accept the solution as it is and let you guys look into it in your own time?
Yes! I don’t need to delete the “job-pod” manually unless the job is pending, however from what I’ve seen the next job will get stuck in pending if the previous “job-pod” is not deleted.
having said that since I’ve gotten to the point of model re-training, that means previous job-pods for preceding tasks such as tfrecords creation should have successfully got removed as intended (I don’t remember seeing more than one “completed” pod in the namespace of the tao toolkit ).