Error when training LPRNet

faddoni1 · May 25, 2021, 1:08pm

Please provide complete information as applicable to your setup.

• Hardware Platform: GPU
• DeepStream Version: 5.1.0
• TensorRT Version: 7.2.2.3
• NVIDIA GPU Driver Version: 460.73.01

Hello. I’m having troubles training the LPRNet model with my custom dataset. My dataset follows the schema
/dataset
./images
./img000.jpg
…
./labels
./img000.txt
…
./characters_list.txt

but it seems to have a problem in, perhaps, indexing maybe?
It gives the following error:

… (PRUNED OUTPUT - NETWORK ARCHITECTURE) …

==================================================================================================
Total params: 14,432,480
Trainable params: 14,424,872
Non-trainable params: 7,608

2021-05-21 14:01:49,794 [INFO] main : Number of images in the training dataset: 2341
2021-05-21 14:01:49,794 [INFO] main : Number of images in the validation dataset: 2341
Epoch 1/24
1/74 […] - ETA: 7:56:42 - loss: 24.5908WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.727762). Check your callbacks.
2021-05-21 14:08:23,314 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (1.727762). Check your callbacks.
73/74 [============================>.] - ETA: 8s - loss: 14.0392 306c05f17d3d:45:63 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
306c05f17d3d:45:63 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
306c05f17d3d:45:63 [0] NCCL INFO NET/IB : No device found.
306c05f17d3d:45:63 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
306c05f17d3d:45:63 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
306c05f17d3d:45:63 [0] NCCL INFO Channel 00/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 01/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 02/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 03/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 04/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 05/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 06/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 07/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 08/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 09/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 10/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 11/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 12/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 13/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 14/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 15/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 16/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 17/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 18/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 19/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 20/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 21/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 22/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 23/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 24/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 25/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 26/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 27/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 28/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 29/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 30/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 31/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1/-
306c05f17d3d:45:63 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
306c05f17d3d:45:63 [0] NCCL INFO comm 0x7f018b000ca0 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE

Epoch 00001: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-01.tlt
74/74 [==============================] - 618s 8s/step - loss: 13.9336
Epoch 2/24
73/74 [============================>.] - ETA: 0s - loss: 3.9204
Epoch 00002: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-02.tlt
74/74 [==============================] - 30s 401ms/step - loss: 3.8944
Epoch 3/24
73/74 [============================>.] - ETA: 0s - loss: 1.7050
Epoch 00003: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-03.tlt
74/74 [==============================] - 29s 391ms/step - loss: 1.6941
Epoch 4/24
73/74 [============================>.] - ETA: 0s - loss: 1.0009
Epoch 00004: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-04.tlt
74/74 [==============================] - 29s 390ms/step - loss: 1.0014
Epoch 5/24
73/74 [============================>.] - ETA: 0s - loss: 0.6964
Epoch 00005: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-05.tlt
Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 274, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 270, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 195, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 727, in fit
use_multiprocessing=use_multiprocessing)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 603, in fit
steps_name=‘steps_per_epoch’)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 332, in model_iteration
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py”, line 299, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/callbacks/ac_callback.py”, line 65, in on_epoch_end
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/callbacks/ac_callback.py”, line 42, in _get_accuracy
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/utils/ctc_decoder.py”, line 33, in decode_ctc_conf
IndexError: list index out of range
Traceback (most recent call last):
File “/usr/local/bin/lprnet”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-05-21 16:15:09,119 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

.

.
It stops after the 5th epoch. Could it be the dataset? I was sure to label it as accurate as possible…
Thanks in advance

faddoni1 · May 25, 2021, 1:10pm

Also my spec file is

random_seed: 42
lpr_config {
  hidden_units: 512
  max_label_length: 8
  arch: "baseline"
  nlayers: 18 #setting nlayers to be 10 to use baseline10 model
}
training_config {
  batch_size_per_gpu: 32
  num_epochs: 24
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-5
    soft_start: 0.001
    annealing: 0.5
  }
  }
  regularizer {
    type: L2
    weight: 5e-4
  }
}
eval_config {
  validation_period_during_training: 5
  batch_size: 1
}
augmentation_config {
    output_width: 96
    output_height: 48
    output_channel: 3
    keep_original_prob: 0.3
    transform_prob: 0.5
    rotate_degree: 5
}
dataset_config {
  data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/myset/train/label"
    image_directory_path: "/workspace/tlt-experiments/data/myset/train/image"
  }
  characters_list_file: "/workspace/tlt-experiments/lprnet/specs/it_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/myset/train/label"
    image_directory_path: "/workspace/tlt-experiments/data/myset/train/image"
  }
}

Morganh · May 25, 2021, 1:50pm

Can you set larger max_label_length and retry?
More, how many classes in the characters? Can you share it_lp_characters.txt?

faddoni1 · May 28, 2021, 8:15am

Tried with max_label_length=10 but no luck.
it_lp_characters.txt contains the Italian Plate characters which are 32:

1.  0
2.  1
3.  2
4.  3
5.  4
6.  5
7.  6
8.  7
9.  8
10. 9
11. A
12. B
13. C
14. D
15. E
16. F
17. G
18. H
19. J
20. K
21. L
22. M
23. N
24. P
25. R
26. S
27. T
28. V
29. W
30. X
31. Y
32. Z

Morganh · May 29, 2021, 3:05pm

In current 3.0_dp version, if trained with the pretrained model from ngc, the list of characters must to have 35. See deepstream_lpr_app/dict_us.txt at master · NVIDIA-AI-IOT/deepstream_lpr_app · GitHub .
For your case, could you try to train without the pretrained model?

phenixbuzzer · July 3, 2021, 4:15pm

Yes, i got also the same error, and i had to add few extra letters to rus language, and it overcame but how to act with another language, invent own model?

Morganh · July 3, 2021, 4:18pm

In latest TLT 3.0-py3 docker, the subset of characters can be trained too.

phenixbuzzer · July 5, 2021, 3:41pm

Morganh, i searched info how to make it but my small knowledge…, can you say more details how to do it, please?

Morganh · July 5, 2021, 3:47pm

@phenixbuzzer
Please create a new forum topic and describe the detailed error. The spec files and characters list are also appreciated.

phenixbuzzer · July 6, 2021, 8:45am

Created(Error when training LPRNet 2 (characters number < 35))

Topic		Replies	Views
Error when training LPRNet DeepStream SDK	3	909	October 12, 2021
Get error when training lprnet with TLT3.0 lancher TAO Toolkit	7	587	October 12, 2021
Error when training LPRNet 2 (characters number < 35) TAO Toolkit	6	1955	September 11, 2021
LPRNet - Poor Accuracy when training from scratch TAO Toolkit	9	1017	October 12, 2021
Lprnet training error (non-null label, index >= num_classes - 1) TAO Toolkit	10	1102	October 12, 2021
LPRNet: Invalid loss, terminating training TAO Toolkit	24	2349	January 5, 2022
LPRNet issue while training using custom data TAO Toolkit	3	1045	December 28, 2021
LPRNet raise ValueError("index can't contain negative values") TAO Toolkit	22	2181	October 12, 2021
Error training from scratch with character 'O' in LPRNet TAO Toolkit	14	1096	June 25, 2021
Duplicate chracter in LPR Retraining using TLT-V3 TAO Toolkit	20	945	October 12, 2021

Error when training LPRNet

Related topics