Error when training LPRNet

Please provide complete information as applicable to your setup.

• Hardware Platform: GPU
• DeepStream Version: 5.1.0
• TensorRT Version: 7.2.2.3
• NVIDIA GPU Driver Version: 460.73.01

Hello. I’m having troubles training the LPRNet model with my custom dataset. My dataset follows the schema
/dataset
./images
./img000.jpg

./labels
./img000.txt

./characters_list.txt

but it seems to have a problem in, perhaps, indexing maybe?
It gives the following error:

… (PRUNED OUTPUT - NETWORK ARCHITECTURE) …

==================================================================================================
Total params: 14,432,480
Trainable params: 14,424,872
Non-trainable params: 7,608

2021-05-21 14:01:49,794 [INFO] main : Number of images in the training dataset: 2341
2021-05-21 14:01:49,794 [INFO] main : Number of images in the validation dataset: 2341
Epoch 1/24
1/74 […] - ETA: 7:56:42 - loss: 24.5908WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.727762). Check your callbacks.
2021-05-21 14:08:23,314 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (1.727762). Check your callbacks.
73/74 [============================>.] - ETA: 8s - loss: 14.0392 306c05f17d3d:45:63 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
306c05f17d3d:45:63 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
306c05f17d3d:45:63 [0] NCCL INFO NET/IB : No device found.
306c05f17d3d:45:63 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
306c05f17d3d:45:63 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
306c05f17d3d:45:63 [0] NCCL INFO Channel 00/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 01/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 02/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 03/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 04/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 05/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 06/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 07/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 08/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 09/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 10/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 11/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 12/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 13/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 14/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 15/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 16/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 17/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 18/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 19/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 20/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 21/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 22/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 23/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 24/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 25/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 26/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 27/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 28/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 29/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 30/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 31/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1/-
306c05f17d3d:45:63 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
306c05f17d3d:45:63 [0] NCCL INFO comm 0x7f018b000ca0 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE

Epoch 00001: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-01.tlt
74/74 [==============================] - 618s 8s/step - loss: 13.9336
Epoch 2/24
73/74 [============================>.] - ETA: 0s - loss: 3.9204
Epoch 00002: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-02.tlt
74/74 [==============================] - 30s 401ms/step - loss: 3.8944
Epoch 3/24
73/74 [============================>.] - ETA: 0s - loss: 1.7050
Epoch 00003: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-03.tlt
74/74 [==============================] - 29s 391ms/step - loss: 1.6941
Epoch 4/24
73/74 [============================>.] - ETA: 0s - loss: 1.0009
Epoch 00004: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-04.tlt
74/74 [==============================] - 29s 390ms/step - loss: 1.0014
Epoch 5/24
73/74 [============================>.] - ETA: 0s - loss: 0.6964
Epoch 00005: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-05.tlt
Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 274, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 270, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 195, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 727, in fit
use_multiprocessing=use_multiprocessing)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 603, in fit
steps_name=‘steps_per_epoch’)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 332, in model_iteration
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py”, line 299, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/callbacks/ac_callback.py”, line 65, in on_epoch_end
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/callbacks/ac_callback.py”, line 42, in _get_accuracy
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/utils/ctc_decoder.py”, line 33, in decode_ctc_conf
IndexError: list index out of range
Traceback (most recent call last):
File “/usr/local/bin/lprnet”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-05-21 16:15:09,119 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

.

.

.
It stops after the 5th epoch. Could it be the dataset? I was sure to label it as accurate as possible…
Thanks in advance

Also my spec file is

random_seed: 42
lpr_config {
  hidden_units: 512
  max_label_length: 8
  arch: "baseline"
  nlayers: 18 #setting nlayers to be 10 to use baseline10 model
}
training_config {
  batch_size_per_gpu: 32
  num_epochs: 24
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-5
    soft_start: 0.001
    annealing: 0.5
  }
  }
  regularizer {
    type: L2
    weight: 5e-4
  }
}
eval_config {
  validation_period_during_training: 5
  batch_size: 1
}
augmentation_config {
    output_width: 96
    output_height: 48
    output_channel: 3
    keep_original_prob: 0.3
    transform_prob: 0.5
    rotate_degree: 5
}
dataset_config {
  data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/myset/train/label"
    image_directory_path: "/workspace/tlt-experiments/data/myset/train/image"
  }
  characters_list_file: "/workspace/tlt-experiments/lprnet/specs/it_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/myset/train/label"
    image_directory_path: "/workspace/tlt-experiments/data/myset/train/image"
  }
}

Can you set larger max_label_length and retry?
More, how many classes in the characters? Can you share it_lp_characters.txt?

Tried with max_label_length=10 but no luck.
it_lp_characters.txt contains the Italian Plate characters which are 32:

1.  0
2.  1
3.  2
4.  3
5.  4
6.  5
7.  6
8.  7
9.  8
10. 9
11. A
12. B
13. C
14. D
15. E
16. F
17. G
18. H
19. J
20. K
21. L
22. M
23. N
24. P
25. R
26. S
27. T
28. V
29. W
30. X
31. Y
32. Z

In current 3.0_dp version, if trained with the pretrained model from ngc, the list of characters must to have 35. See deepstream_lpr_app/dict_us.txt at master · NVIDIA-AI-IOT/deepstream_lpr_app · GitHub .
For your case, could you try to train without the pretrained model?