Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) RTX4090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) TAO4 Baremetal
• Training spec file(If have, please share here) /notebooks/tao_api_starter_kit/client/automl/object_detection.ipynb
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
According the last TAO conference the process can be executed using the PeopleNet network.
Loading the custom dataset and select the pretrained network based on peoplenet (detectnet_v2)
pretrained_map = {“detectnet_v2” : “peoplenet:trainable_v2.6”}
Launch the AutoMl process and nothing happend.
Not logs generated, no compute movement.
Error found in the container:
kubectl logs -n gpu-operator tao-toolkit-api-workflow-pod-859fd5f7cc-mmpjr
AutoML pipeline
Exception in thread Thread-4 (AutoMLPipeline):
Traceback (most recent call last):
File “/usr/local/lib/python3.11/threading.py”, line 1038, in _bootstrap_inner
self.run()
File “/usr/local/lib/python3.11/threading.py”, line 975, in run
self._target(*self._args, **self._kwargs)
File “/opt/api/handlers/actions.py”, line 809, in AutoMLPipeline
complete_specs[“model_config”][“pretrained_model_file”] = pretrained_model_file[0]
~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
The training process is working as expected in a normal mode with the notebook:
/notebooks/tao_launcher_starter_kit/detectnet_v2/detectnet_v2.ipynb
This parameter [“pretrained_model_file”] is automatic.
Check in the “model” folder that the referenced pretrained model exits.
shared/users/00000000-0000-0000-0000-000000000000/models/73109fc5-deb3-4cf4-b0c1-13365b315884/peoplenet_vtrainable_v2.6/resnet34_peoplenet.tlt
Try to force this parameter manually, and the same result.
The default notebook can be executed correctly, but the results with a resnet18 are totaly usefull.
So, do you mean the default notebook can be executed correctly, but can not be executed correctly when you change to your own dataset and new pretrained model?
You mean, that mantain the original pretrain network name (“detectnet_v2” : “detectnet_v2:resnet18”), and sustitute the file “.etlt” with the peoplenet pretrain network?
Only for confirm and try to test.
Thanks for the help.
With the modification I continue watching with the same behaviour.
AutoML pipeline
Exception in thread Thread-3 (AutoMLPipeline):
Traceback (most recent call last):
File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "/opt/api/handlers/actions.py", line 809, in AutoMLPipeline
complete_specs["model_config"]["pretrained_model_file"] = pretrained_model_file[0]
~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
Also to confirm, the file to modify is in the “tao-toolkit-api-workflow” or “tao-toolkit-api-app”. Now I modify both.
Now I will try to sustitute the pretrained network.
Exception in thread Thread-3 (AutoMLPipeline):
Traceback (most recent call last):
File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "/opt/api/handlers/actions.py", line 809, in AutoMLPipeline
complete_specs["model_config"]["pretrained_model_file"] = pretrained_model_file[0]
~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
I don’t know why now change the message:
AutoML pipeline
Exception in thread Thread-10 (AutoMLPipeline):
Traceback (most recent call last):
File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "/opt/api/handlers/actions.py", line 809, in AutoMLPipeline
complete_specs["training_config"]["checkpoint_interval"] = int(epoch_multiplier*current_ri)
^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range