AutoML for v4.0.2 with Efficientnet_b1_relu

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Classification :efficientnet_b1_relu
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v4.0.2
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

For the AutoML notebook notebooks/tao-getting-started_v4.0.2/notebooks/tao_api_starter_kit/api/automl/classification.ipynb

I want to select efficientnet_b1_relu for my experiments. I mention that in Assign PTM section

But when I load the training specs, it still loads Resnet config

And even if I manually change it accordingly

When I run the training action, the monitoring page does refresh but it keeps saying it will update after first epoch, but it takes a long time and doesn’t update. Is there any way I can access logs to figure out what’s going wrong?

You can check the logs via
$ kubectl get pods
$ kubectl logs -f workflow-pod-xxxxx-xxxxx

Then find the command in the end.
Then find the log path and check it.

Which version of TAO API did you install?

I have the v4.0.2
Couldn’t migrate to v5.0 as mentioned in Login issue JWT on TAO-API with Jupyter - #14 by amogh.dabholkar

I believe you did not meet this issue previously, right?

Can you re-install TAO-API with helm?

Please run following commands.

helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.2.tgz -C tao-toolkit-api
# uninstall old tao-api
helm ls
helm delete tao-toolkit-api

# re install tao-api
helm install tao-toolkit-api tao-toolkit-api/ --namespace default

I’ve always faced this issue. I’ll still try to follow this and report back

I mean below issue “nvidia driver modules are not yet loaded invoking run directly”. Previously you did not meet it and can get the log via “kubectl logs -f xxx” , right?

I hadn’t tried getting the logs before

OK, got it.

I reinstalled it and tried nvidia-smi

How about
$ kubectl get pods -A

From our discussion in NVIDIA Driver Installation skipped during bare-metal install - #17 by Morganh,
$ kubectl exec nvidia-smi-xxxx -- nvidia-smi is working.

Could you check what has been changed? In that topic, you config it to a single node cluster.

After that I tried to install v5.0 TAO API and since then it hasn’t been working. I have still config as single node cluster

If you can guide me to migrate from this to v5.0, I am ready to give it a shot

Could you please resume previous environment since previously you can run successfully under 4.0.2?
You can run below to unstall.
$ bash setup.sh uninstall

Okay
I have run $ bash setup.sh uninstall and now there’s no kubectl which is expected.

So now I will try to install TAO API v5.0. Anything else I need to uninstall before starting that?

No. Suggest you install 4.0.2 to check if it can work.

Alright
I’ll try v4.0.2 again