AutoML for v4.0.2 with Efficientnet_b1_relu

amogh.dabholkar · July 20, 2023, 9:43pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Classification :efficientnet_b1_relu
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v4.0.2
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

For the AutoML notebook notebooks/tao-getting-started_v4.0.2/notebooks/tao_api_starter_kit/api/automl/classification.ipynb

I want to select efficientnet_b1_relu for my experiments. I mention that in Assign PTM section

But when I load the training specs, it still loads Resnet config

And even if I manually change it accordingly

When I run the training action, the monitoring page does refresh but it keeps saying it will update after first epoch, but it takes a long time and doesn’t update. Is there any way I can access logs to figure out what’s going wrong?

Morganh · July 21, 2023, 7:51am

You can check the logs via
$ kubectl get pods
$ kubectl logs -f workflow-pod-xxxxx-xxxxx

Then find the command in the end.
Then find the log path and check it.

amogh.dabholkar · July 21, 2023, 1:52pm

Morganh · July 21, 2023, 4:27pm

Which version of TAO API did you install?

amogh.dabholkar · July 21, 2023, 4:30pm

I have the v4.0.2
Couldn’t migrate to v5.0 as mentioned in Login issue JWT on TAO-API with Jupyter - #14 by amogh.dabholkar

Morganh · July 21, 2023, 4:33pm

I believe you did not meet this issue previously, right?

Can you re-install TAO-API with helm?

Please run following commands.

helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.2.tgz -C tao-toolkit-api

# uninstall old tao-api
helm ls
helm delete tao-toolkit-api

# re install tao-api
helm install tao-toolkit-api tao-toolkit-api/ --namespace default

amogh.dabholkar · July 21, 2023, 4:35pm

I’ve always faced this issue. I’ll still try to follow this and report back

Morganh · July 21, 2023, 4:37pm

I mean below issue “nvidia driver modules are not yet loaded invoking run directly”. Previously you did not meet it and can get the log via “kubectl logs -f xxx” , right?

amogh.dabholkar · July 21, 2023, 4:38pm

I hadn’t tried getting the logs before

Morganh · July 21, 2023, 4:38pm

OK, got it.

amogh.dabholkar · July 21, 2023, 4:40pm

I reinstalled it and tried nvidia-smi

Morganh · July 21, 2023, 4:47pm

How about
$ kubectl get pods -A

amogh.dabholkar · July 21, 2023, 4:49pm

Morganh · July 21, 2023, 5:15pm

From our discussion in NVIDIA Driver Installation skipped during bare-metal install - #17 by Morganh,
$ kubectl exec nvidia-smi-xxxx -- nvidia-smi is working.

Could you check what has been changed? In that topic, you config it to a single node cluster.

amogh.dabholkar · July 21, 2023, 5:18pm

After that I tried to install v5.0 TAO API and since then it hasn’t been working. I have still config as single node cluster

amogh.dabholkar · July 21, 2023, 5:20pm

If you can guide me to migrate from this to v5.0, I am ready to give it a shot

Morganh · July 21, 2023, 5:31pm

Could you please resume previous environment since previously you can run successfully under 4.0.2?
You can run below to unstall.
$ bash setup.sh uninstall

amogh.dabholkar · July 21, 2023, 5:35pm

Okay
I have run $ bash setup.sh uninstall and now there’s no kubectl which is expected.

So now I will try to install TAO API v5.0. Anything else I need to uninstall before starting that?

Morganh · July 21, 2023, 5:37pm

No. Suggest you install 4.0.2 to check if it can work.

amogh.dabholkar · July 21, 2023, 5:39pm

Alright
I’ll try v4.0.2 again

Topic		Replies	Views
Exception: TAO4 AutoML with PeopleNet. Round 2 TAO Toolkit	49	1365	June 28, 2023
AutoML training speed and GPU problem TAO Toolkit	28	1640	March 29, 2023
Tao automl TAO Toolkit	8	552	January 6, 2023
'KeyError' : TAO4 AutoML with PeopleNet TAO Toolkit	37	1343	June 1, 2023
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	752	July 17, 2023
AutoML job is stuck halfway through and gpu status is non-responsive TAO Toolkit	3	335	July 26, 2023
Retails object detection, efficientdet issue from gettingstarted package TAO Toolkit opencv	15	938	October 9, 2023
HTTPConnectionPool(host='127.0.1.1', port=31951): Max retries exceeded with url: /api/v1/user/4a268f0c-2b02-4fe3-9ca2-b161c0af7231/dataset TAO Toolkit	3	601	August 4, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO-API Dead at train start TAO Toolkit	46	1327	August 3, 2023
Can't load pre-trained model for Retail Object Detection TAO Toolkit	8	875	April 14, 2023

AutoML for v4.0.2 with Efficientnet_b1_relu

Related topics