Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct

full_log.docx (39.4 KB)

Blockquote
I have never tried the public KITTI dataset mentioned in the notebook.
To recap:
My real-world use case is not an autonomous vehicle.

In my project I have successfully trained a model that tests against my hold-back test set with 93.6% mAP and, more importantly, it is 100% accurate at object detection of the single object in my use case.

I have also successfully exported it and the only problem has been in turning it into a .engine file via the tao-converter.
I really do not want to import the public KITTI dataset and spend time on that unless it is absolutely vital.

Please advise

According to the above training log, the training is not running successfully.

2023-07-03 17:23:37,264 [INFO] tensorflow: Graph was finalized.
2023-07-03 17:23:37,288 [INFO] root: CUDA runtime implicit initialization on GPU:0 failed. Status: the provided PTX was compiled with an unsupported toolchain.
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py>", line 3, in <module>
  File "<frozen iva.detectnet_v2.scripts.train>", line 1032, in <module>
  File "<frozen iva.detectnet_v2.scripts.train>", line 1011, in <module>
  File "<decorator-gen-117>", line 2, in main
  File "<frozen iva.detectnet_v2.utilities.timer>", line 46, in wrapped_fn
  File "<frozen iva.detectnet_v2.scripts.train>", line 994, in main
  File "<frozen iva.detectnet_v2.scripts.train>", line 853, in run_experiment
  File "<frozen iva.detectnet_v2.scripts.train>", line 728, in train_gridbox
  File "<frozen iva.detectnet_v2.scripts.train>", line 197, in run_training_loop
  File "<frozen iva.detectnet_v2.training.utilities>", line 143, in get_singular_monitored_session
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
    self._sess = self._coordinated_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: the provided PTX was compiled with an unsupported toolchain.
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'IsVariableInitialized_1035:0' shape=() dtype=bool>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "<frozen iva.detectnet_v2.training.utilities>", line 143, in get_singular_monitored_session  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
    stop_grace_period_secs=stop_grace_period_secs)  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
    self._sess = self._coordinated_creator.create_session()  File "<frozen moduluspy.modulus.hooks.hooks>", line 285, in begin  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/tf_should_use.py", line 198, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))

But you mentioned earlier that “I have successfully trained a model that tests against my hold-back test set with 93.6% mAP”.

Am I missing something?

More, what is the CPU you are using?
Also, could you share $nvidia-smi as well?

I have been working on this real-world use case for some time, having gathered unique data of a very high quality and having created a database.
I initially trained a binary classifer on the datasets, but was disappointed with the results it achieved.
When I switched to detectnet_v2 the results improved dramatically.
You advised that an issue I was having at a later point might be resolved by changing to a single object detector (from 2 objects) and reducing the image size.
The single object classifier achieved good results and I ran a further 20 experiments, fine tuning the hyper-parameters. It turned out that a higher image size (half capture size) provided the best results. The highest mAP (95.2%) achieved was with a smaller image, but when I ran the ‘Visualize Inferences’ section it was clear that in my use case mAP is not a great metric because the accuracy is lower. The best result; 100% accuracy and around 15% false positives, was achieved by a model with 93.6243% mAP.
I then exported the model as an .etlt and installed it on the Nano, where I used tao-converter, but it failed to convert the .etlt into a .engine for deployment.
You advised me to run a one epoch train experiment to then export and convert.
This proved problematic within the Jupyter Notebook, which threw as eries of KEY-related errors, so you advised me to repeat the experiment outside the notebook from the Command Line (bash).
This is where the current failed training is being carried out.
The most recent part of this narrative, since the tao-converter failed, is chronicled in this thread.

Blockquote
CPU is AMD Ryzen 9 3.30 GHz

Blockquote

please advise

So, there are two remaining issues here.
One is in your training log,
“2023-07-03 17:23:37,288 [INFO] root: CUDA runtime implicit initialization on GPU:0 failed. Status: the provided PTX was compiled with an unsupported toolchain.”.

Please make sure you login the docker correctly.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Also, please update the nvidia-driver to 525.
In the nvidia-smi result you shared, there is not the info. Please share again.

Another is exporting error. Can you double check again?

Blockquote
I have rebooted and am running docker using your arguments.
I will report back in due course.

OK. Thanks for the info.

Running a one epoch training in CLI (bash) I am getting this familiar Error:


inside docker:

outside docker:

Do you know what the issue is here?

For your case, when you run the command inside the docker, you need to set the path: -e /workspace/detectnet_v2/specs/detectnet_v2_train_resnet50_kitti.txt

Aha! Thanks for that.
This current case is definitely outside my experience.

Training ran successfully:

on to exporting

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.