Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) AutoML
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v5.0.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Running the classification notebook in AutoML
And when I run the cell to start the run the action and start the job,
the cell creates a pod which is same as the job_id, and then another pod [e4eca54c-e2bb-4feb-a476-0de3ea24db47] is created which gets stuck at ContainerCreating status throughout.
[Another thing you can see is that even though I stopped the execution, it still keeps relaunching the pod from the previous attempt as well. I delete the pod, but it keeps relaunching again. Pod af82df1b-e7cb-4e2f-b36d-1c54c24bb79e]
So, in this case, please wait for it until the pod is in “complete” state. And then wait for its task is finished. In a word, please wait for.
You can check its task via running "$kubectl describe pod xxx " along with the log mentioned in it.
Thanks guys
For now it seems to be working
I just killed the pods and started again the next day and it worked
I’ll accept this answer as the solution for now