Please provide complete information as applicable to your setup.
• Hardware Platform: GPU
• DeepStream Version: 5.1.0
• TensorRT Version: 7.2.2.3
• NVIDIA GPU Driver Version: 460.73.01
Hello. I’m having troubles training the LPRNet model with my custom dataset. My dataset follows the schema
/dataset
./images
./img000.jpg
…
./labels
./img000.txt
…
./characters_list.txt
but it seems to have a problem in, perhaps, indexing maybe?
It gives the following error:
… (PRUNED OUTPUT - NETWORK ARCHITECTURE) …
==================================================================================================
Total params: 14,432,480
Trainable params: 14,424,872
Non-trainable params: 7,608
2021-05-21 14:01:49,794 [INFO] main: Number of images in the training dataset: 2341
2021-05-21 14:01:49,794 [INFO] main: Number of images in the validation dataset: 2341
Epoch 1/24
1/74 […] - ETA: 7:56:42 - loss: 24.5908WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.727762). Check your callbacks.
2021-05-21 14:08:23,314 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (1.727762). Check your callbacks.
73/74 [============================>.] - ETA: 8s - loss: 14.0392 306c05f17d3d:45:63 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
306c05f17d3d:45:63 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
306c05f17d3d:45:63 [0] NCCL INFO NET/IB : No device found.
306c05f17d3d:45:63 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
306c05f17d3d:45:63 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
306c05f17d3d:45:63 [0] NCCL INFO Channel 00/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 01/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 02/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 03/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 04/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 05/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 06/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 07/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 08/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 09/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 10/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 11/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 12/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 13/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 14/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 15/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 16/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 17/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 18/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 19/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 20/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 21/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 22/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 23/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 24/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 25/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 26/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 27/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 28/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 29/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 30/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Channel 31/32 : 0
306c05f17d3d:45:63 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1/-
306c05f17d3d:45:63 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
306c05f17d3d:45:63 [0] NCCL INFO comm 0x7f018b000ca0 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE
Epoch 00001: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-01.tlt
74/74 [==============================] - 618s 8s/step - loss: 13.9336
Epoch 2/24
73/74 [============================>.] - ETA: 0s - loss: 3.9204
Epoch 00002: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-02.tlt
74/74 [==============================] - 30s 401ms/step - loss: 3.8944
Epoch 3/24
73/74 [============================>.] - ETA: 0s - loss: 1.7050
Epoch 00003: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-03.tlt
74/74 [==============================] - 29s 391ms/step - loss: 1.6941
Epoch 4/24
73/74 [============================>.] - ETA: 0s - loss: 1.0009
Epoch 00004: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-04.tlt
74/74 [==============================] - 29s 390ms/step - loss: 1.0014
Epoch 5/24
73/74 [============================>.] - ETA: 0s - loss: 0.6964
Epoch 00005: saving model to /workspace/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-05.tlt
Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 274, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 270, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py”, line 195, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 727, in fit
use_multiprocessing=use_multiprocessing)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 603, in fit
steps_name=‘steps_per_epoch’)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 332, in model_iteration
callbacks.on_epoch_end(epoch, epoch_logs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py”, line 299, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/callbacks/ac_callback.py”, line 65, in on_epoch_end
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/callbacks/ac_callback.py”, line 42, in _get_accuracy
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/utils/ctc_decoder.py”, line 33, in decode_ctc_conf
IndexError: list index out of range
Traceback (most recent call last):
File “/usr/local/bin/lprnet”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-05-21 16:15:09,119 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
.
.
.
It stops after the 5th epoch. Could it be the dataset? I was sure to label it as accurate as possible…
Thanks in advance