AutoML job is stuck halfway through and gpu status is non-responsive

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) AutoML
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v4.0.2
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I ran the AutoML Bayesian experiements with Efficientnet_b1_relu overnight and when I checked this morning, it was still stuck at experiment 10. I tried checking the status of the GPU, but kubectl exec nvidia-smi-mlrsh-riverplate -- nvidia-smi is non-responsive

Here are the logs-

Controller.json says the next experiment is running, but nothing is getting updated in log.txt,[which has been stuck at 11th epoch for multiple hours now]. And it has been saying running for a very long time.

Could you please share the logs of tao-toolkit-api-app-pod and tao-toolkit-api-workflow-pod?
You can upload it via button
image

I’ve migrated to TAO 5.0
Will update if I see this with a 5.0 experiment as well!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.