Get error when training lprnet with TLT3.0 lancher

Hi all,
I want to train lprnet on my own dataset, everything is ok and configured in correct way.
My system config:
RAM : 16GB
GPU : 1xGTX-1080-8GB
CPU: Interl core-i3

I set batch size = 4

When I run the below command:

tlt lprnet train --gpus=1 --gpu_index=0
-e /workspace/tlt-experiments/specs/tutorial_spec.txt
-r /workspace/tlt-experiments/results
-k nvidia_tlt
-m /workspace/tlt-experiments/results/pretrained_lprnet_baseline18/tlt_lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt

I got this error :

Total params: 14,432,480
Trainable params: 14,424,872
Non-trainable params: 7,608


2021-04-11 20:20:10,585 [INFO] main: Number of images in the training dataset: 152547
2021-04-11 20:20:10,585 [INFO] main: Number of images in the validation dataset: 24610
Epoch 1/24
1/38137 […] - ETA: 91:39:31 - loss: 58.6405WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.772865). Check your callbacks.
2021-04-11 20:20:20,512 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (1.772865). Check your callbacks.
6/38137 […] - ETA: 16:00:16 - loss: 50.6952Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 274, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 270, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 195, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 727, in fit
use_multiprocessing=use_multiprocessing)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 603, in fit
steps_name=‘steps_per_epoch’)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 265, in model_iteration
batch_outs = batch_function(*batch_data)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 1017, in train_on_batch
outputs = self.train_function(ins) # pylint: disable=not-callable
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py”, line 3476, in call
run_metadata=self.run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 36 labels: 5,4,35,4,3,4,1,1 labels seen so far: 5,4
[[{{node loss_2/softmax_loss/CTCLoss}}]]
[[loss_2/softmax_loss/CTCLoss/_6743]]
(1) Invalid argument: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 36 labels: 5,4,35,4,3,4,1,1 labels seen so far: 5,4
[[{{node loss_2/softmax_loss/CTCLoss}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/usr/local/bin/lprnet”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-04-12 00:50:22,428 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I tested 24 image and label for both train and validation, So the training was ok, Why It has problem with large set dataset, This load all of dataset in the training duration?

Can you paste your characters_list_file here?

@Morganh

  1. tutorial_spec.txt (1.1 KB)
  2. lp_characters.txt (72 Bytes)

Please train without pretrained model. Currently, the sub of default characters_list_file is not supported.

Refer to Lprnet training error (non-null label, index >= num_classes - 1)

The lprnet has two main problem:

1- The list of characters must to have 35.
2- Change input dimension of model from scratch training has not support.

If possible please fix these of bugs in future version of TLT.

The item 1 will be fixed in next release.
For item 2, could you describe more and also paste the steps/log here. Thanks.