Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) AutoML
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v4.0.2
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I ran the AutoML Bayesian experiements with Efficientnet_b1_relu overnight and when I checked this morning, it was still stuck at experiment 10. I tried checking the status of the GPU, but
kubectl exec nvidia-smi-mlrsh-riverplate -- nvidia-smi is non-responsive
Here are the logs-
Controller.json says the next experiment is running, but nothing is getting updated in log.txt,[which has been stuck at 11th epoch for multiple hours now]. And it has been saying running for a very long time.