Error training from scratch with character 'O' in LPRNet

Hi

Toolkit - 3.0
GPU - RTX 2070
Driver - 460

I have a dataset which has character ‘O’, as i read on some thread that we cannot train custom model from pre-trained so we have to train from scratch spec file, trying same but getting the below error

File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 274, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 270, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 195, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py", line 603, in fit
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py", line 221, in model_iteration
    batch_data = _get_next_batch(generator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py", line 363, in _get_next_batch
    generator_output = next(generator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/data_utils.py", line 789, in get
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/data_utils.py", line 783, in get
    inputs = self.queue.get(block=True).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/data_utils.py", line 571, in get_index
    return _SHARED_SEQUENCES[uid][i]
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/dataloader/data_sequence.py", line 109, in __getitem__
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/dataloader/data_sequence.py", line 109, in <listcomp>
KeyError: 'O'
Traceback (most recent call last):
  File "/usr/local/bin/lprnet", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-06-21 18:19:44,784 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

NOTE: i have trained a different dataset through pre-trained as well as from scratch but that dataset didn’t have ‘O’. It worked perfectly

Any ideas what might be the issue

Hi,
Could you please share the character list file and your training spec file?

More, did you use TLT 3.0 or TLT 3.0-dp ?
Could you run tlt info --verbose and share the result?

custom_lp_characters.txt (70 Bytes)
tutorial_spec_scratch_custom.txt (1.1 KB)

As mentioned above, please tlt info --verbose and check the docker_tag.
I am afraid you are running with 3.0-dp. Please update to 3.0 instead. The issue should be gone.

1 Like

ok so i was using 3.0-dp upgrade the nvidia-tlt pip package it downloaded the 3.0 docker image here is the tlt-info verobse output

Configuration of the TLT Instance

dockers: 		
	nvidia/tlt-streamanalytics: 			
		docker_registry: nvcr.io
		docker_tag: v3.0-py3
		tasks: 
			1. augment
			2. bpnet
			3. classification
			4. detectnet_v2
			5. dssd
			6. emotionnet
			7. faster_rcnn
			8. fpenet
			9. gazenet
			10. gesturenet
			11. heartratenet
			12. lprnet
			13. mask_rcnn
			14. multitask_classification
			15. retinanet
			16. ssd
			17. unet
			18. yolo_v3
			19. yolo_v4
			20. tlt-converter
	nvidia/tlt-pytorch: 			
		docker_registry: nvcr.io
		docker_tag: v3.0-py3
		tasks: 
			1. speech_to_text
			2. speech_to_text_citrinet
			3. text_classification
			4. question_answering
			5. token_classification
			6. intent_slot_classification
			7. punctuation_and_capitalization
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021

But now i am having these weird issue where it is unable to find the character spec file although it seems to be present inside docker

!tlt lprnet run ls -l $SPECS_DIR

2021-06-22 22:42:28,122 [INFO] root: Registry: ['nvcr.io']
2021-06-22 22:42:28,257 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
total 20
-rw-rw-r-- 1 1000 1000   70 Jun 22 16:49 custom_lp_characters.txt
-rw-rw-r-- 1 1000 1000 1265 Jun 10 11:12 tutorial_spec.txt
-rw-rw-r-- 1 1000 1000 1266 Jun 10 11:12 tutorial_spec_scratch.txt
-rw-rw-r-- 1 1000 1000 1237 Jun 22 16:50 tutorial_spec_scratch_custom.txt
-rw-rw-r-- 1 1000 1000   70 Jun 10 11:12 us_lp_characters.txt
2021-06-22 22:42:30,442 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
!tlt lprnet train --gpus=1 --gpu_index=$GPU_INDEX \
                  -e $SPECS_DIR/tutorial_spec_scratch_custom.txt \
                  -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                  -k nvidia_tlt

Traceback (most recent call last):
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 277, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 273, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 122, in run_experiment
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/models/model_builder.py", line 145, in build
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/examples/lprnet/specs/custom_lp_characters.txt'
2021-06-22 22:43:37,076 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you verify it is not a docker image file issue as 3.0-dp didn’t give the path issue

Can you run following command to check if the file is available?
!tlt lprnet run ls -l $SPECS_DIR/tutorial_spec_scratch_custom.txt

If yes, please run following command as well.
!tlt lprnet run cat $SPECS_DIR/tutorial_spec_scratch_custom.txt

yes files are available and data is also present, below is the output

!tlt lprnet run ls -l $SPECS_DIR/tutorial_spec_scratch_custom.txt


2021-06-23 14:38:15,422 [INFO] root: Registry: ['nvcr.io']
2021-06-23 14:38:15,462 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
-rw-rw-r-- 1 1000 1000 1129 Jun 22 15:28 /workspace/tlt-experiments/lprnet/specs/tutorial_spec_scratch_custom.txt
2021-06-23 14:38:19,313 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

!tlt lprnet run cat $SPECS_DIR/tutorial_spec_scratch_custom.txt


2021-06-23 14:38:31,595 [INFO] root: Registry: ['nvcr.io']
2021-06-23 14:38:31,638 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
random_seed: 42
lpr_config {
  hidden_units: 512
  max_label_length: 7
  arch: "baseline"
  nlayers: 18 #setting nlayers to be 10 to use baseline10 model
}
training_config {
  batch_size_per_gpu: 32
  num_epochs: 100
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-4
    soft_start: 0.001
    annealing: 0.5
  }
  }
  regularizer {
    type: L2
    weight: 5e-4
  }
}
eval_config {
  validation_period_during_training: 5
  batch_size: 1
}
augmentation_config {
    output_width: 96
    output_height: 48
    output_channel: 3
    keep_original_prob: 0.3
    transform_prob: 0.5
    rotate_degree: 5
}
dataset_config {
  data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/custom/train/label"
    image_directory_path: "/workspace/tlt-experiments/data/custom/train/image"
  }
  characters_list_file: "/workspace/examples/lprnet/specs/custom_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/custom/val/label"
    image_directory_path: "/workspace/tlt-experiments/data/custom/val/image"
  }
}
2021-06-23 14:38:34,578 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

So, can you check with following command? This file should be available because it is mentioned in your training spec file.
$ tlt lprnet run ls -l /workspace/examples/lprnet/specs/custom_lp_characters.txt

Yes ideally it should be but running the above command it doesn’t seems to available in workspace, that’s weird.
! tlt lprnet run ls -l /workspace/examples/lprnet/specs/custom_lp_characters.txt


2021-06-23 14:45:31,264 [INFO] root: Registry: ['nvcr.io']
2021-06-23 14:45:31,302 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
ls: cannot access '/workspace/examples/lprnet/specs/custom_lp_characters.txt': No such file or directory
2021-06-23 14:45:33,899 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

What is your ~/.tlt_mounts.json ?

{
    "Mounts": [
        {
            "source": "/home/foss/TLT_custom_number_lprnet",
            "destination": "/workspace/tlt-experiments"
        },
        {
            "source": "/home/foss/tlt_cv_samples_v1.0.2/lprnet/specs",
            "destination": "/workspace/tlt-experiments/lprnet/specs"
        }
    ]
}

Seems that the 3.0-py3 docker does not contain the /workspace/example folder.
You can download it according to TLT Quick Start Guide — Transfer Learning Toolkit 3.0 documentation .
I think you already download 1.0.2 version. So, you can find tlt_cv_samples_v1.0.2/lprnet/specs/us_lp_characters.txt and that is the file you need.
For your case, you can modify the path in your training spec file.

ok so spec path issue I wasn’t able to solve, might be I need to reinstall and that would do it.
Did a workaround as first mount path content were being listed in workspace kept the character file there and changed paths accordingly in spec file.

As you mentioned was able to train in 3.0, issue seems in 3.0-dp

Thanks

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.