Hi all,
I want to train lprnet on my own dataset, everything is ok and configured in correct way.
My system config:
RAM : 16GB
GPU : 1xGTX-1080-8GB
CPU: Interl core-i3
I set batch size = 4
When I run the below command:
tlt lprnet train --gpus=1 --gpu_index=0
-e /workspace/tlt-experiments/specs/tutorial_spec.txt
-r /workspace/tlt-experiments/results
-k nvidia_tlt
-m /workspace/tlt-experiments/results/pretrained_lprnet_baseline18/tlt_lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt
I got this error :
Total params: 14,432,480
Trainable params: 14,424,872
Non-trainable params: 7,608
2021-04-11 20:20:10,585 [INFO] main: Number of images in the training dataset: 152547
2021-04-11 20:20:10,585 [INFO] main: Number of images in the validation dataset: 24610
Epoch 1/24
1/38137 […] - ETA: 91:39:31 - loss: 58.6405WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.772865). Check your callbacks.
2021-04-11 20:20:20,512 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (1.772865). Check your callbacks.
6/38137 […] - ETA: 16:00:16 - loss: 50.6952Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 274, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 270, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 195, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 727, in fit
use_multiprocessing=use_multiprocessing)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 603, in fit
steps_name=‘steps_per_epoch’)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 265, in model_iteration
batch_outs = batch_function(*batch_data)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 1017, in train_on_batch
outputs = self.train_function(ins) # pylint: disable=not-callable
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py”, line 3476, in call
run_metadata=self.run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 36 labels: 5,4,35,4,3,4,1,1 labels seen so far: 5,4
[[{{node loss_2/softmax_loss/CTCLoss}}]]
[[loss_2/softmax_loss/CTCLoss/_6743]]
(1) Invalid argument: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 36 labels: 5,4,35,4,3,4,1,1 labels seen so far: 5,4
[[{{node loss_2/softmax_loss/CTCLoss}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/usr/local/bin/lprnet”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-04-12 00:50:22,428 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.