Get error when training lprnet with TLT3.0 lancher

LoveNvidia · April 11, 2021, 8:26pm

Hi all,
I want to train lprnet on my own dataset, everything is ok and configured in correct way.
My system config:
RAM : 16GB
GPU : 1xGTX-1080-8GB
CPU: Interl core-i3

I set batch size = 4

When I run the below command:

tlt lprnet train --gpus=1 --gpu_index=0
-e /workspace/tlt-experiments/specs/tutorial_spec.txt
-r /workspace/tlt-experiments/results
-k nvidia_tlt
-m /workspace/tlt-experiments/results/pretrained_lprnet_baseline18/tlt_lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt

I got this error :

Total params: 14,432,480
Trainable params: 14,424,872
Non-trainable params: 7,608

2021-04-11 20:20:10,585 [INFO] main: Number of images in the training dataset: 152547
2021-04-11 20:20:10,585 [INFO] main: Number of images in the validation dataset: 24610
Epoch 1/24
1/38137 […] - ETA: 91:39:31 - loss: 58.6405WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.772865). Check your callbacks.
2021-04-11 20:20:20,512 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (1.772865). Check your callbacks.
6/38137 […] - ETA: 16:00:16 - loss: 50.6952Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 274, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 270, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 195, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 727, in fit
use_multiprocessing=use_multiprocessing)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 603, in fit
steps_name=‘steps_per_epoch’)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 265, in model_iteration
batch_outs = batch_function(*batch_data)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 1017, in train_on_batch
outputs = self.train_function(ins) # pylint: disable=not-callable
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py”, line 3476, in call
run_metadata=self.run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 36 labels: 5,4,35,4,3,4,1,1 labels seen so far: 5,4
[[{{node loss_2/softmax_loss/CTCLoss}}]]
[[loss_2/softmax_loss/CTCLoss/_6743]]
(1) Invalid argument: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 36 labels: 5,4,35,4,3,4,1,1 labels seen so far: 5,4
[[{{node loss_2/softmax_loss/CTCLoss}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/usr/local/bin/lprnet”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-04-12 00:50:22,428 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

LoveNvidia · April 11, 2021, 8:36pm

I tested 24 image and label for both train and validation, So the training was ok, Why It has problem with large set dataset, This load all of dataset in the training duration?

Morganh · April 12, 2021, 1:09am

Can you paste your characters_list_file here?

LoveNvidia · April 12, 2021, 5:42am

@Morganh

tutorial_spec.txt (1.1 KB)
lp_characters.txt (72 Bytes)

Morganh · April 12, 2021, 10:16am

Please train without pretrained model. Currently, the sub of default characters_list_file is not supported.

Refer to Lprnet training error (non-null label, index >= num_classes - 1)

LoveNvidia · May 4, 2021, 4:13pm

The lprnet has two main problem:

1- The list of characters must to have 35.
2- Change input dimension of model from scratch training has not support.

If possible please fix these of bugs in future version of TLT.

Morganh · May 4, 2021, 4:17pm

The item 1 will be fixed in next release.
For item 2, could you describe more and also paste the steps/log here. Thanks.

Topic		Replies	Views
LPRNet issue while training using custom data TAO Toolkit	3	994	December 28, 2021
Error when training LPRNet DeepStream SDK	3	856	October 12, 2021
Error when training LPRNet TAO Toolkit tensorrt , cuda	10	1251	July 13, 2021
LPRNet raise ValueError("index can't contain negative values") TAO Toolkit	22	1976	October 12, 2021
LPRNet Error on Openalpr Dataset while training TAO Toolkit	18	866	October 12, 2021
LPRNet - Poor Accuracy when training from scratch TAO Toolkit	9	923	October 12, 2021
Tlt lprnet export error, TypeError: set_data_preprocessing_parameters() got an unexpected keyword argument 'image_mean' TAO Toolkit	7	1243	October 12, 2021
LPRNet: Invalid loss, terminating training TAO Toolkit	24	2156	January 5, 2022
To avoid NaN loss, please set the output_width >= 100. And then restart the training TAO Toolkit	2	508	August 27, 2022
Tlt 3.0 retrained vehicletypenet, classification net error when loaded pretrained model TAO Toolkit	4	403	October 12, 2021

Get error when training lprnet with TLT3.0 lancher

Related topics