LPRNet Error on Openalpr Dataset while training

Hi

Toolkit 3.0
Driver - 460
GPU - RTX 2070

Trying to train the default openalpr dataset in LPRNet but during “tlt lprnet train” giving this error

For multi-GPU, change --gpus based on your machine.
2021-06-09 21:56:40,102 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2021-06-09 16:27:38,405 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2021-06-09 16:27:38,406 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:56: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2021-06-09 16:27:39,011 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:56: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:59: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-06-09 16:27:39,013 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:59: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:60: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

2021-06-09 16:28:03,912 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:60: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

2021-06-09 16:28:03,913 [INFO] /usr/local/lib/python3.6/dist-packages/iva/lprnet/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/lprnet/specs/tutorial_spec.txt
2021-06-09 16:28:03,925 [INFO] __main__: Loading pretrained weights. This may take a while...
Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 274, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 270, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 105, in run_experiment
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/utils/model_io.py", line 78, in load_model_as_pretrain
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/utils/model_io.py", line 41, in load_model
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/utils/model_io.py", line 29, in load_model
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/save.py", line 146, in load_model
    loader_impl.parse_saved_model(filepath)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/saved_model/loader_impl.py", line 83, in parse_saved_model
    constants.SAVED_MODEL_FILENAME_PB))
OSError: SavedModel file does not exist at: /tmp/tmpzxyyngdo.hdf5/{saved_model.pbtxt|saved_model.pb}
Traceback (most recent call last):
  File "/usr/local/bin/lprnet", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-06-09 21:58:28,931 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Refer to LPRnet example fails to run - TLT

Yes i saw this thread and ran the command line, based on that the folder seems to have the pre-trained model inside docker, below is the output

2021-06-09 22:18:27,665 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
/workspace/tlt-experiments/lprnet/pretrained_lprnet_baseline18/tlt_lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt
2021-06-09 22:18:32,324 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

To narrow down, could you try to login 3.0_dp docker and run training?
$ docker run --runtime=nvidia -it nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3 /bin/bash
Then, run training via
# lprnet train xxx xxx

Hey, I’m having the exact same issue. Can’t figure out whats the problem.
I also tried to train inside the docker as @Morganh suggested, and got the same error.

@m4x.mona Please check again according to LPRnet example fails to run - TLT - #3 by Morganh

This is what I get when running

!tlt lprnet run ls $USER_EXPERIMENT_DIR/pretrained_lprnet_baseline18/tlt_lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt
2021-06-13 16:31:45,280 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
/workspace/tlt-experiments/lprnet/pretrained_lprnet_baseline18/tlt_lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt
2021-06-13 16:31:46,078 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

@leroren220
There is no error. And actually it is expected.
The log shows that /workspace/tlt-experiments/lprnet/pretrained_lprnet_baseline18/tlt_lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt is available.

Hey @Morganh , All the files exist in the docker right were they should be, I get the same output as @leroren220 and @priyanshthakore when running !tlt lprnet run ls …
I still get the OSError: SavedModel file does not exist at: …

@m4x.mona
OK, I will check further. May I know the detailed steps for how to reproduce?
BTW,

  • which docker?
  • did you run with Jupyter notebook?

Hey

The steps to reproduce are:

  1. Following steps 1-4 here:
    Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation

  2. Following instructions at the lprnet/lprnet.ipynb notebook

The docker is nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3.
I also reproduced the same error on the docker without using tlt.

@m4x.mona
https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/quickstart/deepstream_integration.html
is not available, right?

Hmm yeah, It was available some time ago

OK, got it. Yes, the user guide has been updated due to latest release.

On my side, I just train inside the docker or outside the docker. But I cannot reproduce. The training works well.

$ tlt lprnet train -e /workspace/demo_3.0/lprnet/specs/tutorial_spec.txt -r /workspace/demo_3.0/lprnet/experiment_dir_unpruned -k nvidia_tlt -m /workspace/demo_3.0/lprnet/pretrained_lprnet_baseline18/us_lprnet_baseline18_unpruned.tlt

Could you share the full log with me? If you were running with Jupyter notebook, you can attach the .ipynb file here.

I found the root cause of yours. @m4x.mona @leroren220 @priyanshthakore

See lprnet model card https://ngc.nvidia.com/catalog/models/nvidia:tlt_lprnet
The Model load key: nvidia_tlt
So, if you were using the pretrained model in ngc, please set “-k” to nvidia_tlt instead of others.

2 Likes

Yes just changing the key to “nvidia_tlt” instead of our key worked need to use “nvidia_tlt” throughout the notebook

It works, thank you.

Please try latest 3.0-py3 docker. It will catch such error and prompt end user to check the ngc key.