TAO 5.0 job stuck in "pending" and job_id stuck in "ContainerCreating"

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) AutoML
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v5.0.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Running the classification notebook in AutoML
And when I run the cell to start the run the action and start the job,

the cell creates a pod which is same as the job_id, and then another pod [e4eca54c-e2bb-4feb-a476-0de3ea24db47] is created which gets stuck at ContainerCreating status throughout.

Here are the logs of all the pods

[Another thing you can see is that even though I stopped the execution, it still keeps relaunching the pod from the previous attempt as well. I delete the pod, but it keeps relaunching again. Pod af82df1b-e7cb-4e2f-b36d-1c54c24bb79e]

I have looked at this issue but that didn’t help.

How many GPUs did you set?

Did you uninstall previous TAO API completely?
$bash setup.sh uninstall

Can you run below?
$ kubectl describe pod e4eca54c-e2bb-4feb-a476-0de3ea24db47

Please make sure that when a pod is running into “ContainerCreating”, it may be pulling a docker. For example, get below log with above command,

Normal Pulling 81s kubelet Pulling image “nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

So, in this case, please wait for it until the pod is in “complete” state. And then wait for its task is finished. In a word, please wait for.
You can check its task via running "$kubectl describe pod xxx " along with the log mentioned in it.

The POD image is near to 10Gb, so it appears to be stoped.

Also after a while the pod alwais return a CrashBackLoop.

Found the solution including the NGC credentials in the “containerd” configuration.

Thanks guys
For now it seems to be working
I just killed the pods and started again the next day and it worked
I’ll accept this answer as the solution for now

