TAO 5.0 job stuck in "pending" and job_id stuck in "ContainerCreating"

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) AutoML
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v5.0.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Running the classification notebook in AutoML
And when I run the cell to start the run the action and start the job,

the cell creates a pod which is same as the job_id, and then another pod [e4eca54c-e2bb-4feb-a476-0de3ea24db47] is created which gets stuck at ContainerCreating status throughout.

Here are the logs of all the pods

[Another thing you can see is that even though I stopped the execution, it still keeps relaunching the pod from the previous attempt as well. I delete the pod, but it keeps relaunching again. Pod af82df1b-e7cb-4e2f-b36d-1c54c24bb79e]

I have looked at this issue but that didn’t help.

How many GPUs did you set?

Did you uninstall previous TAO API completely?
$bash setup.sh uninstall

Can you run below?
$ kubectl describe pod e4eca54c-e2bb-4feb-a476-0de3ea24db47

Please make sure that when a pod is running into “ContainerCreating”, it may be pulling a docker. For example, get below log with above command,


Normal Pulling 81s kubelet Pulling image “nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

So, in this case, please wait for it until the pod is in “complete” state. And then wait for its task is finished. In a word, please wait for.
You can check its task via running "$kubectl describe pod xxx " along with the log mentioned in it.

The POD image is near to 10Gb, so it appears to be stoped.

Also after a while the pod alwais return a CrashBackLoop.

Found the solution including the NGC credentials in the “containerd” configuration.

Thanks guys
For now it seems to be working
I just killed the pods and started again the next day and it worked
I’ll accept this answer as the solution for now

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.