LPRNet training and deployment

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : rtx 4090/t4(6.3 DS) and Nano(6.0 DS)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : LPRNet

While training how to save weights based on best metrics (like accuracy) rather than after every 5 epochs?

I have trained a model using tao toolkit and the weight files are as .hdf5 format. How to convert this to etlt so as to deploy in the pipeline?

I generated the onnx file from hdf5, but when I use that onnx in DS pipeline to generate the engine file error pops up:

Using file: ./models/anpr_config.yml
0:00:00.583616331 221656 0x560ff81ece40 INFO                 nvinfer gstnvinfer.cpp:682:gst_nvinfer_logger:<secondary-infer-engine2> NvDsInferContext[UID 3]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:2002> [UID = 3]: Trying to create engine from model files
[libprotobuf ERROR google/protobuf/text_format.cc:298] Error parsing text-format onnx2trt_onnx.ModelProto: 1:1: Invalid control characters encountered in text.
[libprotobuf ERROR google/protobuf/text_format.cc:298] Error parsing text-format onnx2trt_onnx.ModelProto: 2:11: Invalid control characters encountered in text.
[libprotobuf ERROR google/protobuf/text_format.cc:298] Error parsing text-format onnx2trt_onnx.ModelProto: 2:17: Already saw decimal point or exponent; can't have another one.
[libprotobuf ERROR google/protobuf/text_format.cc:298] Error parsing text-format onnx2trt_onnx.ModelProto: 2:13: Message type "onnx2trt_onnx.ModelProto" has no field named "keras2onnx".
ERROR: [TRT]: ModelImporter.cpp:688: Failed to parse ONNX model from file: /home/mainak/ms/C++/anpr_kp/models/lprnet_epoch-024.onnx
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:315 Failed to parse onnx file
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:971 failed to build network since parsing model errors.
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:804 failed to build network.
0:00:02.577464454 221656 0x560ff81ece40 ERROR                nvinfer gstnvinfer.cpp:676:gst_nvinfer_logger:<secondary-infer-engine2> NvDsInferContext[UID 3]: Error in NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:2022> [UID = 3]: build engine file failed
0:00:02.615730998 221656 0x560ff81ece40 ERROR                nvinfer gstnvinfer.cpp:676:gst_nvinfer_logger:<secondary-infer-engine2> NvDsInferContext[UID 3]: Error in NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2108> [UID = 3]: build backend context failed
0:00:02.615758235 221656 0x560ff81ece40 ERROR                nvinfer gstnvinfer.cpp:676:gst_nvinfer_logger:<secondary-infer-engine2> NvDsInferContext[UID 3]: Error in NvDsInferContextImpl::initialize() <nvdsinfer_context_impl.cpp:1282> [UID = 3]: generate backend failed, check config file settings
0:00:02.616255703 221656 0x560ff81ece40 WARN                 nvinfer gstnvinfer.cpp:898:gst_nvinfer_start:<secondary-infer-engine2> error: Failed to create NvDsInferContext instance
0:00:02.616263316 221656 0x560ff81ece40 WARN                 nvinfer gstnvinfer.cpp:898:gst_nvinfer_start:<secondary-infer-engine2> error: Config file path: /home/mainak/ms/C++/anpr_kp/models/lpr_config_sgie_us.yml, NvDsInfer Error: NVDSINFER_CONFIG_FAILED
Running...
ERROR from element secondary-infer-engine2: Failed to create NvDsInferContext instance
Error details: gstnvinfer.cpp(898): gst_nvinfer_start (): /GstPipeline:ANPR-pipeline/GstNvInfer:secondary-infer-engine2:
Config file path: /home/mainak/ms/C++/anpr_kp/models/lpr_config_sgie_us.yml, NvDsInfer Error: NVDSINFER_CONFIG_FAILED
Returned, stopping playback
Deleting pipeline
Disconnecting MQTT Client
Destroying MQTT Client

@Morganh Any suggestion on this is highly appreciated

Can you follow tao_tutorials/notebooks/tao_launcher_starter_kit/lprnet/lprnet.ipynb at main · NVIDIA/tao_tutorials · GitHub to generate onnx file? See the “Deploy!” section.

I followed it to get the onnx. However, when deployed then I get the above said error. Also I might retrain hence, I wanted to know how to get tlt from hdf5 and etlt from onnx?

Can you use Netron to open this onnx file successfully?

ANy help on this?? @Morganh

I’m able to integrate it.

You can set it in tao_tutorials/notebooks/tao_launcher_starter_kit/lprnet/specs/tutorial_spec.txt at main · NVIDIA/tao_tutorials · GitHub.

Ok. But checkpoint_interval takes int values. How can I save on best metrics like (accuracy)

The result folder will save the model which has best accuracy.

So like if checkpoint_interval = 5, then the which checkpoint will be saved? the 5th one for the best among last 5 epochs?

This checkpoint_interval means running evaluation every 5 epochs.
You can set it to 1. Then run evaluation every 1 epoch.
Then the best model can be found from all the epochs.

@Morganh
One final thing before I close the topic. I’m trying to retrain the model:

tao model lprnet train --gpus=1 --gpu_index=0 -e /workspace/tao-experiments/lprnet/specs/tutorial_spec.txt -k nvidia_tlt -r /workspace/tao-experiments/lprnet/experiment_dir_unpruned -m /workspace/tao-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-070.hdf5 --initial_epoch 70

the following error comes up:

INFO: Log file already exists at /workspace/tao-experiments/lprnet/experiment_dir_unpruned/status.json
INFO: Merging specification from /workspace/tao-experiments/lprnet/specs/tutorial_spec.txt
INFO: Loading pretrained weights. This may take a while...
INFO: Training was interrupted
INFO: Training was interrupted.
Execution status: PASS
2024-06-12 07:57:59,796 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

However, the $KEY is set to nvidia-_tlt. What key to use in order to retrain .hdf5 weight? Any suggestion on this? I’m trainig on ec2 instance

According to License Plate Recognition | NVIDIA NGC.
The key is: nvidia_tlt

@Morganh
Any suggestion on the above error?

There is not error. The info usually means your key is wrong.
As mentioned above, please use nvidia_tlt.

I’ve set it:

tao model lprnet train --gpus=1 --gpu_index=0 -e /workspace/tao-experiments/lprnet/specs/tutorial_spec.txt -k nvidia_tlt -r /workspace/tao-experiments/lprnet/experiment_dir_unpruned -m /workspace/tao-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-070.hdf5 --initial_epoch 70

Still the error persists! I’m retraining against the hdf5 file not tlt. Will the key change in that case?

How did you train and get experiment_dir_unpruned/weights/lprnet_epoch-070.hdf5? Is there any key when you run previous training?

Yes the for training I used the below command:

tao model lprnet train --gpus=1 --gpu_index=0 -e /workspace/tao-experiments/lprnet/specs/tutorial_spec.txt -k nvidia_tlt -r /workspace/tao-experiments/lprnet/experiment_dir_unpruned -m /workspace/tao-experiments/lprnet/pretrained_lprnet_baseline18/lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt

where -k nvidia_tlt

Please try not set -k since you are retraining with new model.

Now, I’m using:

tao model lprnet train --gpus=1 --gpu_index=0 -e /workspace/tao-experiments/lprnet/specs/tutorial_spec.txt -r /workspace/tao-experiments/lprnet/experiment_dir_unpruned -m /workspace/tao-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-070.hdf5 --initial_epoch 70

still the error persists!!