• Hardware (RTX 3090)
• Network Type (LprNet)
• TLT Version dockers: [['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm'] format_version: 2.0 toolkit_version: 3.21.11 published_date: 11/08/2021
]
• Training spec file (
tutorial_spec_scratch.txt (1.3 KB)
)
• How to reproduce the issue?
I am trying to train an LPRNet model and it is crashing with the following error:
Non-trainable params: 7,608
__________________________________________________________________________________________________
2021-12-21 13:59:08,637 [INFO] __main__: Number of images in the training dataset: 9038
2021-12-21 13:59:08,638 [INFO] __main__: Number of images in the validation dataset: 1537
Epoch 1/100
1/565 [..............................] - ETA: 1:03:53 - loss: 23.7524WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.521408). C
heck your callbacks.
2021-12-21 13:59:15,803 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (0.521408). Check your callbacks.
17/565 [..............................] - ETA: 4:06 - loss: 21.9864Batch 17: Invalid loss, terminating training
18/565 [..............................] - ETA: 3:54 - loss: inf c3adbbcc58e0:64:98 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
c3adbbcc58e0:64:98 [0] NCCL INFO NET/Plugin : Plugin load returned 12 : libnccl-net.so: cannot open shared object file: No such file or directory.
c3adbbcc58e0:64:98 [0] NCCL INFO NET/IB : No device found.
c3adbbcc58e0:64:98 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>
c3adbbcc58e0:64:98 [0] NCCL INFO Using network Socket
NCCL version 2.9.9+cuda11.3
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 00/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 01/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 02/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 03/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 04/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 05/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 06/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 07/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 08/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 09/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 10/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 11/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 12/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 13/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 14/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 15/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 16/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 17/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 18/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 19/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 20/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 21/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 22/32 : 0 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 23/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 24/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 25/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 26/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 27/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 28/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 29/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 30/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 31/32 : 0
c3adbbcc58e0:64:98 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1
/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1
->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1
->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1
->0->-1 [31] -1/-1/-1->0->-1
c3adbbcc58e0:64:98 [0] NCCL INFO Connected all rings
c3adbbcc58e0:64:98 [0] NCCL INFO Connected all trees
c3adbbcc58e0:64:98 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
c3adbbcc58e0:64:98 [0] NCCL INFO comm 0x7f134dd56cc0 rank 0 nranks 1 cudaDev 0 busId 9000 - Init COMPLETE
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/generic_utils.py:372: RuntimeWarning: invalid value encountered in multiply
self._values[k][0] += v * (current - self._seen_so_far)
18/565 [..............................] - ETA: 4:21 - loss: nan
*******************************************
Accuracy: 373 / 1537 0.24268054651919324
*******************************************
2021-12-21 19:29:40,169 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
FYI: I am using official us_lp_charaters.txt with 35 letter, however, my original dataset looks something like this: max length is 20 and char dict is {'3': 5327, '0': 8335, '6': 5239, '9': 5075, '2': 6157, '7': 5011, '1': 5846, '8': 6077, '4': 6537, '5': 5167, 'U': 113, 'P': 113, 'C': 113, 'E': 1})
i.e I only have [0,1,2,3,4,5,6,7,8,9,U,P,C,E] in my dataset. I, however, didn’t change official us_lp_charaters.txt
I changed max_len to 20 as that was the max length in my dataset.
I am not sure what’s causing this. I tried LearningRate as low as with bs 16
min_learning_rate: 1e-6 max_learning_rate: 1e-5
Thank you!