Error training from scratch with character 'O' in LPRNet

priyanshthakore · June 21, 2021, 2:47pm

Hi

Toolkit - 3.0
GPU - RTX 2070
Driver - 460

I have a dataset which has character ‘O’, as i read on some thread that we cannot train custom model from pre-trained so we have to train from scratch spec file, trying same but getting the below error

File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 274, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 270, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 195, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py", line 603, in fit
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py", line 221, in model_iteration
    batch_data = _get_next_batch(generator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py", line 363, in _get_next_batch
    generator_output = next(generator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/data_utils.py", line 789, in get
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/data_utils.py", line 783, in get
    inputs = self.queue.get(block=True).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/data_utils.py", line 571, in get_index
    return _SHARED_SEQUENCES[uid][i]
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/dataloader/data_sequence.py", line 109, in __getitem__
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/dataloader/data_sequence.py", line 109, in <listcomp>
KeyError: 'O'
Traceback (most recent call last):
  File "/usr/local/bin/lprnet", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-06-21 18:19:44,784 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

NOTE: i have trained a different dataset through pre-trained as well as from scratch but that dataset didn’t have ‘O’. It worked perfectly

Any ideas what might be the issue

Morganh · June 22, 2021, 2:23am

Hi,
Could you please share the character list file and your training spec file?

Morganh · June 22, 2021, 2:39am

More, did you use TLT 3.0 or TLT 3.0-dp ?
Could you run tlt info --verbose and share the result?

priyanshthakore · June 22, 2021, 3:29pm

custom_lp_characters.txt (70 Bytes)
tutorial_spec_scratch_custom.txt (1.1 KB)

Morganh · June 22, 2021, 3:31pm

As mentioned above, please tlt info --verbose and check the docker_tag.
I am afraid you are running with 3.0-dp. Please update to 3.0 instead. The issue should be gone.

priyanshthakore · June 22, 2021, 5:24pm

ok so i was using 3.0-dp upgrade the nvidia-tlt pip package it downloaded the 3.0 docker image here is the tlt-info verobse output

Configuration of the TLT Instance

dockers: 		
	nvidia/tlt-streamanalytics: 			
		docker_registry: nvcr.io
		docker_tag: v3.0-py3
		tasks: 
			1. augment
			2. bpnet
			3. classification
			4. detectnet_v2
			5. dssd
			6. emotionnet
			7. faster_rcnn
			8. fpenet
			9. gazenet
			10. gesturenet
			11. heartratenet
			12. lprnet
			13. mask_rcnn
			14. multitask_classification
			15. retinanet
			16. ssd
			17. unet
			18. yolo_v3
			19. yolo_v4
			20. tlt-converter
	nvidia/tlt-pytorch: 			
		docker_registry: nvcr.io
		docker_tag: v3.0-py3
		tasks: 
			1. speech_to_text
			2. speech_to_text_citrinet
			3. text_classification
			4. question_answering
			5. token_classification
			6. intent_slot_classification
			7. punctuation_and_capitalization
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021

But now i am having these weird issue where it is unable to find the character spec file although it seems to be present inside docker

!tlt lprnet run ls -l $SPECS_DIR

2021-06-22 22:42:28,122 [INFO] root: Registry: ['nvcr.io']
2021-06-22 22:42:28,257 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
total 20
-rw-rw-r-- 1 1000 1000   70 Jun 22 16:49 custom_lp_characters.txt
-rw-rw-r-- 1 1000 1000 1265 Jun 10 11:12 tutorial_spec.txt
-rw-rw-r-- 1 1000 1000 1266 Jun 10 11:12 tutorial_spec_scratch.txt
-rw-rw-r-- 1 1000 1000 1237 Jun 22 16:50 tutorial_spec_scratch_custom.txt
-rw-rw-r-- 1 1000 1000   70 Jun 10 11:12 us_lp_characters.txt
2021-06-22 22:42:30,442 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

!tlt lprnet train --gpus=1 --gpu_index=$GPU_INDEX \
                  -e $SPECS_DIR/tutorial_spec_scratch_custom.txt \
                  -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                  -k nvidia_tlt


Traceback (most recent call last):
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 277, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 273, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 122, in run_experiment
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/models/model_builder.py", line 145, in build
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/examples/lprnet/specs/custom_lp_characters.txt'
2021-06-22 22:43:37,076 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you verify it is not a docker image file issue as 3.0-dp didn’t give the path issue

Morganh · June 23, 2021, 2:26am

Can you run following command to check if the file is available?
!tlt lprnet run ls -l $SPECS_DIR/tutorial_spec_scratch_custom.txt

If yes, please run following command as well.
!tlt lprnet run cat $SPECS_DIR/tutorial_spec_scratch_custom.txt

priyanshthakore · June 23, 2021, 9:11am

yes files are available and data is also present, below is the output

!tlt lprnet run ls -l $SPECS_DIR/tutorial_spec_scratch_custom.txt


2021-06-23 14:38:15,422 [INFO] root: Registry: ['nvcr.io']
2021-06-23 14:38:15,462 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
-rw-rw-r-- 1 1000 1000 1129 Jun 22 15:28 /workspace/tlt-experiments/lprnet/specs/tutorial_spec_scratch_custom.txt
2021-06-23 14:38:19,313 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

!tlt lprnet run cat $SPECS_DIR/tutorial_spec_scratch_custom.txt


2021-06-23 14:38:31,595 [INFO] root: Registry: ['nvcr.io']
2021-06-23 14:38:31,638 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
random_seed: 42
lpr_config {
  hidden_units: 512
  max_label_length: 7
  arch: "baseline"
  nlayers: 18 #setting nlayers to be 10 to use baseline10 model
}
training_config {
  batch_size_per_gpu: 32
  num_epochs: 100
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-4
    soft_start: 0.001
    annealing: 0.5
  }
  }
  regularizer {
    type: L2
    weight: 5e-4
  }
}
eval_config {
  validation_period_during_training: 5
  batch_size: 1
}
augmentation_config {
    output_width: 96
    output_height: 48
    output_channel: 3
    keep_original_prob: 0.3
    transform_prob: 0.5
    rotate_degree: 5
}
dataset_config {
  data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/custom/train/label"
    image_directory_path: "/workspace/tlt-experiments/data/custom/train/image"
  }
  characters_list_file: "/workspace/examples/lprnet/specs/custom_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/custom/val/label"
    image_directory_path: "/workspace/tlt-experiments/data/custom/val/image"
  }
}
2021-06-23 14:38:34,578 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · June 23, 2021, 9:14am

So, can you check with following command? This file should be available because it is mentioned in your training spec file.
$ tlt lprnet run ls -l /workspace/examples/lprnet/specs/custom_lp_characters.txt

priyanshthakore · June 23, 2021, 9:18am

Yes ideally it should be but running the above command it doesn’t seems to available in workspace, that’s weird.
! tlt lprnet run ls -l /workspace/examples/lprnet/specs/custom_lp_characters.txt


2021-06-23 14:45:31,264 [INFO] root: Registry: ['nvcr.io']
2021-06-23 14:45:31,302 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
ls: cannot access '/workspace/examples/lprnet/specs/custom_lp_characters.txt': No such file or directory
2021-06-23 14:45:33,899 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · June 23, 2021, 9:21am

What is your ~/.tlt_mounts.json ?

priyanshthakore · June 23, 2021, 9:24am

{
    "Mounts": [
        {
            "source": "/home/foss/TLT_custom_number_lprnet",
            "destination": "/workspace/tlt-experiments"
        },
        {
            "source": "/home/foss/tlt_cv_samples_v1.0.2/lprnet/specs",
            "destination": "/workspace/tlt-experiments/lprnet/specs"
        }
    ]
}

Morganh · June 23, 2021, 9:29am

Seems that the 3.0-py3 docker does not contain the /workspace/example folder.
You can download it according to TLT Quick Start Guide — Transfer Learning Toolkit 3.0 documentation .
I think you already download 1.0.2 version. So, you can find tlt_cv_samples_v1.0.2/lprnet/specs/us_lp_characters.txt and that is the file you need.
For your case, you can modify the path in your training spec file.

priyanshthakore · June 23, 2021, 2:23pm

ok so spec path issue I wasn’t able to solve, might be I need to reinstall and that would do it.
Did a workaround as first mount path content were being listed in workspace kept the character file there and changed paths accordingly in spec file.

As you mentioned was able to train in 3.0, issue seems in 3.0-dp

Thanks

system · June 25, 2021, 2:23pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LPRNet Error on Openalpr Dataset while training TAO Toolkit	18	865	October 12, 2021
Train with my own tlt model #2 TAO Toolkit	42	2777	February 8, 2022
Errors encountered when using TAO to train LPRnet TAO Toolkit	19	697	November 17, 2021
Running tlt- docker.errors.DockerException: Error while fetching server API version TAO Toolkit	16	3657	August 28, 2021
Tao toolkit facenet Error TAO Toolkit	14	1282	March 7, 2022
Error in TAO-Toolkit while training TAO Toolkit	15	1505	July 6, 2022
OSError: Specfile not found plz help TAO Toolkit	16	1585	October 12, 2021
LPRNet Error TAO Toolkit	13	227	June 19, 2024
Getting [INFO] tlt.components.docker_handler.docker_handler: Stopping container. Why does this occur and how to fix it? TAO Toolkit	20	1912	August 24, 2021
Tlt lprnet export error, TypeError: set_data_preprocessing_parameters() got an unexpected keyword argument 'image_mean' TAO Toolkit	7	1242	October 12, 2021

Error training from scratch with character 'O' in LPRNet

Related topics