Exception: TAO4 AutoML with PeopleNet. Round 2

Please provide the following information when requesting support.

• Hardware: RTXA6000ADA
• Network Type : Detectnet_v2
• TLT Version: 4.0.2.api

Re open topic: Exception: TAO4 AUTOML with peoplenet

I was thinking that this issue was finally solve in the release 4.0.2, but appears again.

$ kubectl logs -n gpu-operator tao-toolkit-api-workflow-pod-78848b8764-zc9gx
NGC CLI 3.19.0
AutoML pipeline
Exception in thread Thread-2 (AutoMLPipeline):
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/api/handlers/actions.py", line 810, in AutoMLPipeline
    complete_specs["model_config"]["pretrained_model_file"] = pretrained_model_file[0]
                                                              ~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Same situation than before. I’m using the datasets and the train.json file generated to train with the API Client without Automl. With a sucesfull and very satisfied results.

Attach metadata.json (the notebook sample, have missing parts to generate the file)

{
  "id": "526f8699-5fbb-47db-ad35-3632acf42152",
  "created_on": "2023-05-31T15:45:06.650363",
  "last_modified": "2023-05-31T15:45:06.650381",
  "name": "My Model",
  "description": "My TAO Model",
  "version": "1.0.0",
  "logo": "https://www.nvidia.com",
  "ngc_path": "",
  "encryption_key": "tlt_encode",
  "read_only": false,
  "public": false,
  "network_arch": "detectnet_v2",
  "dataset_type": "object_detection",
  "actions": [
    "train",
    "evaluate",
    "prune",
    "retrain",
    "export",
    "convert",
    "inference"
  ],
  "train_datasets": [
    "36410922-0967-4b36-be79-2f3aa859c6bc"
  ],
  "eval_dataset": "5c71ff48-f958-4fcb-a5c6-d6d5cd010990",
  "inference_dataset": null,
  "additional_id_info": null,
  "calibration_dataset": null,
  "ptm": "00e8bc75-c346-489d-ac31-e6f0e30389db",
  "automl_enabled": true,
  "automl_algorithm": "HyperBand",
  "metric": "map",
  "automl_add_hyperparameters": "[]",
  "automl_remove_hyperparameters": "[]",
  "automl_nu": 3,
  "automl_R": 27,
  "epoch_multiplier": 10
}

The spec file are working properly in a normal API Client training.

Digging a little bit. When the API generate the files to start the training. Generate a folder with the code of the step, and inside place a txt with the result of mix all the json files to generate the true “spec.file”.
Well i notice that using the AutoML the parameter:

model_config {
  pretrained_model_file: "/shared/users/00000000-0000-0000-0000-000000000000/models/00e8bc75-c346-489d-ac31-e6f0e30389db/peoplenet_vtrainable_v2.6/resnet34_peoplenet.tlt"

Is not inserted automatically!!!

The datasets are correctly attached, but the pretrain network not. Maybe this give you any clue to catch the part of the code responsible of that.

Thanks in advance.
I hope that one day can use our tools to finalize my work!

To double check or as workaround, could you login workflow pod to modify the code and check if it works?

kubectl exec -it tao-toolkit-api-workflow-pod-xxxxxxxx-yyyy -- /bin/bash

Then
$ vim /opt/api/handlers/actions.py
complete_specs[“model_config”][“pretrained_model_file”] = pretrained_model_file

Save.

Then check again.

Same result :|

AutoML pipeline
Exception in thread Thread-5 (AutoMLPipeline):
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/api/handlers/actions.py", line 810, in AutoMLPipeline
    complete_specs["model_config"]["pretrained_model_file"] = pretrained_model_file
                                                              ^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

Inside the Job folder can fin this files:
image
Inside recomendations_0.kitti, have a complete train spec, but! the pretrained_model_file is not set!

So, there is no issue when use a normal API Client notebook .
But there is issue when use AutoML notebook, right?

Exactly, I show you the results in the other topic.

EDIT and Remark: avoiding all the other problems found when the API try to automatic generate the specfiles, shown on previews forum topics.
Like: ‘KeyError’ : TAO4 AutoML with PeopleNet

Could you share this spec file?

1 Like

Attached in a PM

You already get the best model and also corresponding spec file. For your case, the automl_recommendation_0.kitti is the spec file. You can use it and the existing best model to run evaluation etc.

Refer to https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/, you can plug the model and spec file into the end-to-end notebook(TAO API Starter Kit/Notebooks/client/end2end/detectnet_v2.ipynb).

I don’t know where you want to go.

Focus on the main problem, the train die at this point… generate the files and the next step broke the train.
Attach this train spec of the first of the supposed automl trains to show you the information that the “API” code is not inserting in it.

Search inside the recomendations_0.kitti the next parameter pretrained_model_file… if you can

So, I have a long, hard and tedious process before can reach and select my best model!! and before can do a evaluation of that!!!

Could you please share the latest .ipynb file?
And could you please try to set

specs["model_config"]["pretrained_model_file"] =  your_pretrained_model_file_path

when run inside below cell?

Sadly, we are in the same point.
Added manually the specs["model_config"]["pretrained_model_file"] = your_pretrained_model_file_path to the train.

With this modificacion this is the result of the Pod:

AutoML pipeline
Exception in thread Thread-7 (AutoMLPipeline):
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/api/handlers/actions.py", line 810, in AutoMLPipeline
    complete_specs["model_config"]["pretrained_model_file"] = pretrained_model_file
                                                              ^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

And without the modification:

AutoML pipeline
Exception in thread Thread-9 (AutoMLPipeline):
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/api/handlers/actions.py", line 810, in AutoMLPipeline
    complete_specs["model_config"]["pretrained_model_file"] = pretrained_model_file[0]
                                                              ~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Sincerely a exhausted to waste time debugging unknow problems of code.

Still reproducing this error on my side. Will update to you once have. Thanks.

Can reproduce the issue with nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api

Solution:
In below docker,
$ docker run --runtime=nvidia -it nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api /bin/bash
root@538530c4dbc4:/opt/api# apt-get -y install vim

Modify 1st code:
root@538530c4dbc4:/opt/api# vim handlers/actions.py

Go to line:779 and change the code below
From

    if find_trained_weight == []:
        if not ptm_id == "":
            model_dir = f"/shared/users/00000000-0000-0000-0000-000000000000/models/{ptm_id}"
            if job_context.network == "lprnet":
                pretrained_model_file = glob.glob(model_dir+"/*/*.tlt")
            else:
                pretrained_model_file = glob.glob(model_dir+"/*/*.hdf5")
    else:
        find_trained_weight.sort(reverse=False)
        trained_weight = find_trained_weight[0]

to


    if find_trained_weight == []:
        if not ptm_id == "":
            model_dir = f"/shared/users/00000000-0000-0000-0000-000000000000/models/{ptm_id}"
            pretrained_model_file = []
            pretrained_model_file = glob.glob(model_dir+"/*/*.hdf5") + glob.glob(model_dir+"/*/*.tlt")
            if len(pretrained_model_file) > 1:
                pretrained_model_file = pretrained_model_file[0]
               
            assert pretrained_model_file != [], "error pretrained_model_file"
    else:
        find_trained_weight.sort(reverse=False)
        trained_weight = find_trained_weight[0]

Modify 2nd code:
root@538530c4dbc4:/opt/api# vim handlers/docker_images.py

#in line 23, replace the docker image name

From

"api": os.getenv('IMAGE_API', default='nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api')

To

"api": os.getenv('IMAGE_API', default='nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api-fix')

Then, use docker commit to generate a new docker image.
Please open a new terminal, then check the current container id.

$ docker ps
CONTAINER ID   IMAGE
538530c4dbc4   nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api

In my example, the container id is 538530c4dbc4.
Then, generate the new docker image.

$ docker commit 538530c4dbc4 nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api-fix

Helm install using the new docker images

$ helm ls

#Delete tao-toolkit-api

$ helm delete tao-toolkit-api

#Get the tao-toolkit-api chart

$ helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz --username='$oauthtoken' --password=`<NGC key>`

$ mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.2.tgz -C tao-toolkit-api

$ cd tao-toolkit-api/

$ vi tao-toolkit-api/values.yaml
#in line 2
From

image: nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api

To

image: nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api-fix

#in line 4
From

imagePullPolicy: Always

To

imagePullPolicy: IfNotPresent

#Helm install tao-toolkit-api
$ helm install tao-toolkit-api tao-toolkit-api/ --namespace default

If you are using containerd as k8s container runtimes, you need to import the docker image that you create as below:
$ docker save -o tao-api.tar nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api-fix
$ sudo ctr -n=k8s.io image import tao-api.tar

1 Like

Thank you Morgan!
Sorry for the delay, I was having troubles in the kubernetes deployment.

Ping you when can test it!

Hi again, I don’t know were broke my kubernetes…

  Normal   Scheduled  6m39s                  default-scheduler  Successfully assigned gpu-operator/tao-toolkit-api-app-pod-5d548f545-bxq6d to azken
  Normal   Pulled     4m59s (x5 over 6m39s)  kubelet            Container image "nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api-fix" already present on machine
  Normal   Created    4m59s (x5 over 6m39s)  kubelet            Created container tao-toolkit-api-app
  Normal   Started    4m59s (x5 over 6m39s)  kubelet            Started container tao-toolkit-api-app
  Warning  BackOff    85s (x29 over 6m37s)   kubelet            Back-off restarting failed container tao-toolkit-api-app in pod tao-toolkit-api-app-pod-5d548f545-bxq6d_gpu-operator(e0841a14-71e3-4831-8dc6-282cdb09ae01)

After reinstall everything I get stuck here… I’m using the containerd method.
I try to do a second loop…

Do you have log for this stuck?
Also, could you run $ kubectl logs -f the_failed_pod

No log generated. The only information obtained was over the “describe” command.

I don’t know where can search.

I’m using containerd like suggest nvidia to deploy the kubernetes environment and launch gpu-operator.

So now reviewing and starting again. I delete the tao-api-fix containerd image.
And try to import again with the command:

sudo ctr -n=k8s.io image import tao-api.tar

Question. Need a different namespace to work? or the extraction method or import method may be different? I’m not an expert of this area.

At this time the size of the image was higher than before.

nvcr.io/nvidia/tao/tao-toolkit                                4.0.2-api-fix             71452cf26f3a5       756MB

But when try to enter to the bash, give me the next message:

$ sudo crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                              ATTEMPT             POD ID
a1e176f9b0f01       71452cf26f3a5       43 seconds ago      Exited              tao-toolkit-api-app               15                  1c3409456916b

$   sudo crictl exec -it a1e176f9b0f01 /bin/sh
FATA[0000] execing command in container: rpc error: code = Unknown desc = container is in CONTAINER_EXITED state 

$ sudo crictl logs -p a1e176f9b0f01
FATA[0000] failed to try resolving symlinks in path "/var/log/pods/gpu-operator_tao-toolkit-api-app-pod-5d548f545-vdkks_46c766ba-06a5-4425-92de-09deeed13fa4/tao-toolkit-api-app/47.log": lstat /var/log/pods/gpu-operator_tao-toolkit-api-app-pod-5d548f545-vdkks_46c766ba-06a5-4425-92de-09deeed13fa4/tao-toolkit-api-app/47.log: no such file or directory 

sudo crictl inspect a1e176f9b0f01
{
  "status": {
    "id": "b973ac898f602ad5e78cb25df8e2e39f80a9f200ccf2c4652ac63c5176b1b9a7",
    "metadata": {
      "attempt": 51,
      "name": "tao-toolkit-api-app"
    },
    "state": "CONTAINER_EXITED",
    "createdAt": "2023-06-13T12:54:48.474679225+02:00",
    "startedAt": "2023-06-13T12:54:48.573280963+02:00",
    "finishedAt": "2023-06-13T12:54:48.574673151+02:00",
    "exitCode": 0,
    "image": {
      "annotations": {},
      "image": "nvcr.io/nvidia/tao/tao-toolkit:4.0.2-api-fix"
    },
    "imageRef": "sha256:71452cf26f3a5088d32b96042f0af2e7efbe3579fdca855e130a00f5d1b5df7e",
    "reason": "Completed",
    "message": "",
    "labels": {
      "io.kubernetes.container.name": "tao-toolkit-api-app",
      "io.kubernetes.pod.name": "tao-toolkit-api-app-pod-5d548f545-vdkks",
      "io.kubernetes.pod.namespace": "gpu-operator",
      "io.kubernetes.pod.uid": "46c766ba-06a5-4425-92de-09deeed13fa4"
    },

Searching in the path "/var/log/pods/gpu-operator_tao-toolkit-api-app-pod-5d548f545-vdkks_46c766ba-06a5-4425-92de-09deeed13fa4/tao-toolkit-api-app/47.log": lstat /var/log/pods/gpu-operator_tao-toolkit-api-app-pod-5d548f545-vdkks_46c766ba-06a5-4425-92de-09deeed13fa4/tao-toolkit-api-app/"can’t find the 47.log but have an empty log 48.log

i don’t know where can find more info. Sorry. But this amount of issues are draining my energy and my working time…

I will double check this workaround.
More, the new TAO version will release soon. Then you can use the formal release.

I hope that!
You know the date? I accumulate an excesive delay :(

Sorry, there is not exact date yet. Please stay tune.