Exception: TAO4 AutoML with PeopleNet

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) RTX4090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) TAO4 Baremetal
• Training spec file(If have, please share here) /notebooks/tao_api_starter_kit/client/automl/object_detection.ipynb
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

According the last TAO conference the process can be executed using the PeopleNet network.

Loading the custom dataset and select the pretrained network based on peoplenet (detectnet_v2)
pretrained_map = {“detectnet_v2” : “peoplenet:trainable_v2.6”}

Launch the AutoMl process and nothing happend.

Not logs generated, no compute movement.

Error found in the container:
kubectl logs -n gpu-operator tao-toolkit-api-workflow-pod-859fd5f7cc-mmpjr

AutoML pipeline
Exception in thread Thread-4 (AutoMLPipeline):
Traceback (most recent call last):
File “/usr/local/lib/python3.11/threading.py”, line 1038, in _bootstrap_inner
self.run()
File “/usr/local/lib/python3.11/threading.py”, line 975, in run
self._target(*self._args, **self._kwargs)
File “/opt/api/handlers/actions.py”, line 809, in AutoMLPipeline
complete_specs[“model_config”][“pretrained_model_file”] = pretrained_model_file[0]
~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

The training process is working as expected in a normal mode with the notebook:
/notebooks/tao_launcher_starter_kit/detectnet_v2/detectnet_v2.ipynb

Best regards.
Alejandro.

Please check the specs.
More, can you run successfully with default notebook?

Ok, well, what check?

This parameter [“pretrained_model_file”] is automatic.
Check in the “model” folder that the referenced pretrained model exits.
shared/users/00000000-0000-0000-0000-000000000000/models/73109fc5-deb3-4cf4-b0c1-13365b315884/peoplenet_vtrainable_v2.6/resnet34_peoplenet.tlt

Try to force this parameter manually, and the same result.

The default notebook can be executed correctly, but the results with a resnet18 are totaly usefull.

So, do you mean the default notebook can be executed correctly, but can not be executed correctly when you change to your own dataset and new pretrained model?

Yes, with resnet18 don’t have problems.

This is the extract of the iteration.
{
“id”: “7a6147ff-7f38-4e13-bfc1-52aa544639e2”,
“parent_id”: null,
“action”: “train”,
“created_on”: “2023-02-09T12:14:16.550859”,
“last_modified”: “2023-02-10T08:51:11.604078”,
“status”: “Done”,
“result”: {}
}
{“date”: “2/10/2023”, “time”: “6:43:58”, “status”: “SUCCESS”, “verbosity”: “INFO”, “message”: “DetectNet_v2 training job complete.”, “categorical”: {“average_precision”: {“bag”: 58.4226,“face”: 75.9628, “person”: 86.9157}}, “graphical”: {“validation cost”: 4.406e-05, “mean average precision”: 60.6377}, “kpi”: {“size”: 42.96617889404297, “param_count”: 11.210718}}

Can you try to use the new Pretrained model but do not change the default name?
That means, replace it, but do not change original name.

You mean, that mantain the original pretrain network name (“detectnet_v2” : “detectnet_v2:resnet18”), and sustitute the file “.etlt” with the peoplenet pretrain network?

Only for confirm and try to test.
Thanks for the help.

Yes, please have a try.

It is not .etlt file. Please use .tlt file.

That means, change default .hdf5 file to ngc peoplenet .tlt file.

1 Like

Hi,
A workaround is that , please change /opt/api/handlers/actions.py line782 as below.

Change
if job_context.network == “lprnet”:

to
if job_context.network in [“lprnet”,“detectnet_v2”,]:

Thank you,

ping you as son as possible when test it.

With the modification I continue watching with the same behaviour.

AutoML pipeline
Exception in thread Thread-3 (AutoMLPipeline):
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/api/handlers/actions.py", line 809, in AutoMLPipeline
    complete_specs["model_config"]["pretrained_model_file"] = pretrained_model_file[0]
                                                              ~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Also to confirm, the file to modify is in the “tao-toolkit-api-workflow” or “tao-toolkit-api-app”. Now I modify both.

Now I will try to sustitute the pretrained network.

Hi again,

I try with this method and the same result.

Exception in thread Thread-3 (AutoMLPipeline):
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/api/handlers/actions.py", line 809, in AutoMLPipeline
    complete_specs["model_config"]["pretrained_model_file"] = pretrained_model_file[0]
                                                              ~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

I don’t know why now change the message:

AutoML pipeline
Exception in thread Thread-10 (AutoMLPipeline):
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/api/handlers/actions.py", line 809, in AutoMLPipeline
    complete_specs["training_config"]["checkpoint_interval"] = int(epoch_multiplier*current_ri)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

We will completely fix the issue in next TAO release.

1 Like

Thanks for the answer.
When will be out the next release?

It is about 2 months.

Can share a hotfix to continue working?

Syncing with internal team. Will update to you if have.

News?

A new container is needed. Thanks for your patience.

Thankyou!

ping when it can be tested!