AutoML for v4.0.2 with Efficientnet_b1_relu

An odd thing I observe is that check-inventory doesn’t print any logs. It just asks for this path and exits. Initially it used to ask for everything else and go through with checks.

Alright
nvidia-smi works now and HTTPConnectionPool(host=‘127.0.1.1’, port=31951): Max retries exceeded with url: /api/v1/user/4a268f0c-2b02-4fe3-9ca2-b161c0af7231/dataset also seems to be resolved

About the main issue here, how can I select any model other than resnet? No matter which model I choose in the pre-trained mapping, the specs are always loaded with resnet18. I want to choose efficientnet_b1_relu

After using kubectl logs -f to check logs for multiple pods, and running the AutoML pipeline via the notebook
Job ID: a3f71b72-9f37-422a-b94d-0514a10ec332

  1. I had mentioned efficientnet_b1_relu but it asks to run training with resnet_34
  2. The logs go to tmp directories
  3. My screen still says this -

Can you tell me how to run AutoML pipeline with whichever model I want?

Any update on this?

Sorry for late reply. Did you ever check the training spec?

Yes I have to specifically edit it to make it efficientnet_b1_relu. Like so

This seems like a brute force way to do it, is there no way in which the specs for the ptm_id model can be loaded?

Also, after this when I run the AutoML experiment, like I said, it is forever stuck on the same screen and no progress is shown on the monitoring block

Can you check the log which is mentioned in below?

I’m sorry but I have no idea what that path is. it’s not on my system, I’m not sure where the path comes from in the AutoML pipeline

Could you please try to find below file or folder?
image

Maybe in /mnt/nfs_server/

Got it
So the arch parameter should be set to ‘efficientnet_b1’
I’ll try with that.

Is there no other way to load the spec file for efficientnet_b1 other than manually overriding the one loaded for resnet?

Good catch. It matches Image Classification (TF1) - NVIDIA Docs

For TAO-API, it is using specs[model_config]["arch"] to set the backbone.

How do I stop a job that I have submitted? I tried the technique mentioned in the notebook but the logs.txt shows that it is still executing

You can click “stop” button to stop the running cell.
If not working, you can also run “$kubectl delete pod <pod-name>” to delete the running pod(usually the latest pod will show on the top of $kubectl get pods).

Thank you
And if I have to try classificatin_tf2, I will have to install TAO 5.0 API right? With v4.0.2 only classification_tf1 is possible I reckon?

Yes, for classification_tf2 running in TAO-API, it is needed to use TAO5.0 TAO API.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.