Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct

pddarrell · July 3, 2023, 6:08pm

Blockquote
I have never tried the public KITTI dataset mentioned in the notebook.
To recap:
My real-world use case is not an autonomous vehicle.

In my project I have successfully trained a model that tests against my hold-back test set with 93.6% mAP and, more importantly, it is 100% accurate at object detection of the single object in my use case.

I have also successfully exported it and the only problem has been in turning it into a .engine file via the tao-converter.
I really do not want to import the public KITTI dataset and spend time on that unless it is absolutely vital.

Please advise

Morganh · July 4, 2023, 1:51am

According to the above training log, the training is not running successfully.

2023-07-03 17:23:37,264 [INFO] tensorflow: Graph was finalized.
2023-07-03 17:23:37,288 [INFO] root: CUDA runtime implicit initialization on GPU:0 failed. Status: the provided PTX was compiled with an unsupported toolchain.
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py>", line 3, in <module>
  File "<frozen iva.detectnet_v2.scripts.train>", line 1032, in <module>
  File "<frozen iva.detectnet_v2.scripts.train>", line 1011, in <module>
  File "<decorator-gen-117>", line 2, in main
  File "<frozen iva.detectnet_v2.utilities.timer>", line 46, in wrapped_fn
  File "<frozen iva.detectnet_v2.scripts.train>", line 994, in main
  File "<frozen iva.detectnet_v2.scripts.train>", line 853, in run_experiment
  File "<frozen iva.detectnet_v2.scripts.train>", line 728, in train_gridbox
  File "<frozen iva.detectnet_v2.scripts.train>", line 197, in run_training_loop
  File "<frozen iva.detectnet_v2.training.utilities>", line 143, in get_singular_monitored_session
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
    self._sess = self._coordinated_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: the provided PTX was compiled with an unsupported toolchain.
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'IsVariableInitialized_1035:0' shape=() dtype=bool>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "<frozen iva.detectnet_v2.training.utilities>", line 143, in get_singular_monitored_session  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
    stop_grace_period_secs=stop_grace_period_secs)  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
    self._sess = self._coordinated_creator.create_session()  File "<frozen moduluspy.modulus.hooks.hooks>", line 285, in begin  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/tf_should_use.py", line 198, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))

But you mentioned earlier that “I have successfully trained a model that tests against my hold-back test set with 93.6% mAP”.

Am I missing something?

More, what is the CPU you are using?
Also, could you share $nvidia-smi as well?

pddarrell · July 4, 2023, 4:24pm

I have been working on this real-world use case for some time, having gathered unique data of a very high quality and having created a database.
I initially trained a binary classifer on the datasets, but was disappointed with the results it achieved.
When I switched to detectnet_v2 the results improved dramatically.
You advised that an issue I was having at a later point might be resolved by changing to a single object detector (from 2 objects) and reducing the image size.
The single object classifier achieved good results and I ran a further 20 experiments, fine tuning the hyper-parameters. It turned out that a higher image size (half capture size) provided the best results. The highest mAP (95.2%) achieved was with a smaller image, but when I ran the ‘Visualize Inferences’ section it was clear that in my use case mAP is not a great metric because the accuracy is lower. The best result; 100% accuracy and around 15% false positives, was achieved by a model with 93.6243% mAP.
I then exported the model as an .etlt and installed it on the Nano, where I used tao-converter, but it failed to convert the .etlt into a .engine for deployment.
You advised me to run a one epoch train experiment to then export and convert.
This proved problematic within the Jupyter Notebook, which threw as eries of KEY-related errors, so you advised me to repeat the experiment outside the notebook from the Command Line (bash).
This is where the current failed training is being carried out.
The most recent part of this narrative, since the tao-converter failed, is chronicled in this thread.

Blockquote
CPU is AMD Ryzen 9 3.30 GHz

Blockquote

Screenshot from 2023-07-04 16-57-481310×380 48.3 KB

please advise

Morganh · July 6, 2023, 5:00pm

So, there are two remaining issues here.
One is in your training log,
“2023-07-03 17:23:37,288 [INFO] root: CUDA runtime implicit initialization on GPU:0 failed. Status: the provided PTX was compiled with an unsupported toolchain.”.

Please make sure you login the docker correctly.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Also, please update the nvidia-driver to 525.
In the nvidia-smi result you shared, there is not the info. Please share again.

Another is exporting error. Can you double check again?

pddarrell · July 7, 2023, 5:38pm

Blockquote
I have rebooted and am running docker using your arguments.
I will report back in due course.

Morganh · July 9, 2023, 7:30am

OK. Thanks for the info.

pddarrell · July 9, 2023, 10:48am

Running a one epoch training in CLI (bash) I am getting this familiar Error:

inside docker:

outside docker:

Do you know what the issue is here?

Morganh · July 9, 2023, 1:18pm

For your case, when you run the command inside the docker, you need to set the path: -e /workspace/detectnet_v2/specs/detectnet_v2_train_resnet50_kitti.txt

pddarrell · July 9, 2023, 2:44pm

Aha! Thanks for that.
This current case is definitely outside my experience.

pddarrell · July 10, 2023, 9:31am

Training ran successfully:

on to exporting

system · July 25, 2023, 5:54am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
UffParser: Validator error: block_4c_bn_3/cond/Switch: Unsupported operation _Switch TAO Toolkit tensorrt	38	1371	January 11, 2022
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1718	August 22, 2023
Detectnet_v2 notebook stuck at tfrecords conversion step TAO Toolkit	17	51	October 30, 2024
Tao toolkit facenet Error TAO Toolkit	14	1282	March 7, 2022
Custom TAO unet model classifying only two classes on Deepstream! TAO Toolkit	34	1703	May 12, 2022
Tao toolkit observations TAO Toolkit	56	957	May 29, 2024
TAO toolkit happend some .so bug TAO Toolkit tao	19	907	September 9, 2022
Tao toolkit detectnet training kitty format error TAO Toolkit	10	417	December 8, 2023
Detectnet_v2 training core dumped error TAO Toolkit tensorrt , tensorflow , deep-learning , tao	24	1082	June 21, 2022
Error in TAO-Toolkit while training TAO Toolkit	15	1513	July 6, 2022

Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct

Related topics