LPRNet: Invalid loss, terminating training

• Hardware (RTX 3090)
• Network Type (LprNet)
• TLT Version dockers: [['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm'] format_version: 2.0 toolkit_version: 3.21.11 published_date: 11/08/2021 ]
• Training spec file (
tutorial_spec_scratch.txt (1.3 KB)
)
• How to reproduce the issue?

I am trying to train an LPRNet model and it is crashing with the following error:

Non-trainable params: 7,608                                                                                                                                             
__________________________________________________________________________________________________                                                                      
2021-12-21 13:59:08,637 [INFO] __main__: Number of images in the training dataset:        9038                                                                          
2021-12-21 13:59:08,638 [INFO] __main__: Number of images in the validation dataset:      1537                                                                          
Epoch 1/100                                                                                                                                                             
  1/565 [..............................] - ETA: 1:03:53 - loss: 23.7524WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.521408). C
heck your callbacks.                                                                                                                                                    
2021-12-21 13:59:15,803 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (0.521408). Check your callbacks.                        
 17/565 [..............................] - ETA: 4:06 - loss: 21.9864Batch 17: Invalid loss, terminating training                                                        
 18/565 [..............................] - ETA: 3:54 - loss: inf    c3adbbcc58e0:64:98 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>                                  
c3adbbcc58e0:64:98 [0] NCCL INFO NET/Plugin : Plugin load returned 12 : libnccl-net.so: cannot open shared object file: No such file or directory.                      
c3adbbcc58e0:64:98 [0] NCCL INFO NET/IB : No device found.                                                                                                              
c3adbbcc58e0:64:98 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>                                                                            
c3adbbcc58e0:64:98 [0] NCCL INFO Using network Socket                                                                                                                   
NCCL version 2.9.9+cuda11.3                                                                                                                                             
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 00/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 01/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 02/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 03/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 04/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 05/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 06/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 07/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 08/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 09/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 10/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 11/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 12/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 13/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 14/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 15/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 16/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 17/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 18/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 19/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 20/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 21/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 22/32 :    0                                                0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 23/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 24/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 25/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 26/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 27/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 28/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 29/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 30/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 31/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1
/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1
->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1
->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1
->0->-1 [31] -1/-1/-1->0->-1              
c3adbbcc58e0:64:98 [0] NCCL INFO Connected all rings                                                                                                                    
c3adbbcc58e0:64:98 [0] NCCL INFO Connected all trees                                                                                                                    
c3adbbcc58e0:64:98 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer                                                                            
c3adbbcc58e0:64:98 [0] NCCL INFO comm 0x7f134dd56cc0 rank 0 nranks 1 cudaDev 0 busId 9000 - Init COMPLETE
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/generic_utils.py:372: RuntimeWarning: invalid value encountered in multiply
  self._values[k][0] += v * (current - self._seen_so_far)                                                                                                               
 18/565 [..............................] - ETA: 4:21 - loss: nan                                                                                                        

*******************************************                                         
Accuracy: 373 / 1537  0.24268054651919324                                           
*******************************************                                         


2021-12-21 19:29:40,169 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

FYI: I am using official us_lp_charaters.txt with 35 letter, however, my original dataset looks something like this: max length is 20 and char dict is {'3': 5327, '0': 8335, '6': 5239, '9': 5075, '2': 6157, '7': 5011, '1': 5846, '8': 6077, '4': 6537, '5': 5167, 'U': 113, 'P': 113, 'C': 113, 'E': 1}) i.e I only have [0,1,2,3,4,5,6,7,8,9,U,P,C,E] in my dataset. I, however, didn’t change official us_lp_charaters.txt

I changed max_len to 20 as that was the max length in my dataset.

I am not sure what’s causing this. I tried LearningRate as low as with bs 16
min_learning_rate: 1e-6 max_learning_rate: 1e-5

Thank you!

Could you try to modify us_lp_charaters.txt and train again? Thanks.

Did that same thing.

I tried updating us_lp_char.txt
I tried decreasing soft_lr to .0001 from .001 but the same error.
I tried removing that E:1 from my dataset as it seemed outlier but still the same.

train command I am using:
tao lprnet train --gpus=1 --gpu_index=0 --use_amp \ -e /workspace/tao-experiments/lprnet/specs/tutorial_spec_scratch.txt \ -r /workspace/tao-experiments/lprnet/experiments/experiment_dir_unpruned \ -k nvidia_tlt \ -m /workspace/tao-experiments/lprnet/pretrained_lprnet_baseline18/lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt
I even trained without pretrained model -m but still the same

Could you please trigger below experiments to narrow down?

  1. Without “–use_amp” and retry
  2. If failed, please try to split the dataset to half and retry.

I tried without amp but same error, also 1 update
I have {‘\n’: 6629} char in my dataset. i.e in im.txt it will have a \n at the end.
I hope we are stripping the labels before training inside tao? So this shouldn’t be a problem, right?

Could you share an example of your label file?

Never mind, I removed the null char at the end and still got the error

I tried with half dataset also randomly splitted. But still got the same.

Logs of half dataset

__________________________________________________________________________________________________                                                                      
lstm (LSTM)                     (None, 24, 512)      8423424     flatten_feature[0][0]                                                                                  
__________________________________________________________________________________________________                                                                      
td_dense (TimeDistributed)      (None, 24, 36)       18468       lstm[0][0]                                                                                             
__________________________________________________________________________________________________                                                                      
softmax (Softmax)               (None, 24, 36)       0           td_dense[0][0]                                                                                         
==================================================================================================                                                                      
Total params: 14,432,480                                                                                                                                                
Trainable params: 14,424,872                                                                                                                                            
Non-trainable params: 7,608                                                                                                                                             
__________________________________________________________________________________________________                                                                      
2021-12-21 15:32:44,374 [INFO] __main__: Number of images in the training dataset:        3646                                                                          
2021-12-21 15:32:44,374 [INFO] __main__: Number of images in the validation dataset:      1605                                                                          
Epoch 1/100                                                                                                                                                             
  1/114 [..............................] - ETA: 15:07 - loss: 13.7407WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.550720). Che
ck your callbacks.                                                                                                                                                      
2021-12-21 15:32:52,963 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (0.550720). Check your callbacks.                        
  6/114 [>.............................] - ETA: 2:39 - loss: 19.4261Batch 6: Invalid loss, terminating training                                                         
  7/114 [>.............................] - ETA: 2:17 - loss: inf    73800b3ca595:64:98 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>                                  
73800b3ca595:64:98 [0] NCCL INFO NET/Plugin : Plugin load returned 12 : libnccl-net.so: cannot open shared object file: No such file or directory.                      
73800b3ca595:64:98 [0] NCCL INFO NET/IB : No device found.                                                                                                              
73800b3ca595:64:98 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>                                                                            
73800b3ca595:64:98 [0] NCCL INFO Using network Socket                                                                                                                   
NCCL version 2.9.9+cuda11.3                                                                                                                                             
73800b3ca595:64:98 [0] NCCL INFO Channel 00/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 01/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 02/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 03/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 04/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 05/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 06/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 07/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 08/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 09/32 :    0 
3800b3ca595:64:98 [0] NCCL INFO Channel 10/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 11/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 12/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 13/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 14/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 15/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 16/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 17/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 18/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 19/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 20/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 21/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 22/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 23/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 24/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 25/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 26/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 27/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 28/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 29/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 30/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 31/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
73800b3ca595:64:98 [0] NCCL INFO Connected all rings
73800b3ca595:64:98 [0] NCCL INFO Connected all trees
73800b3ca595:64:98 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
73800b3ca595:64:98 [0] NCCL INFO comm 0x7f74c9d55b00 rank 0 nranks 1 cudaDev 0 busId 9000 - Init COMPLETE
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/generic_utils.py:372: RuntimeWarning: invalid value encountered in multiply
  self._values[k][0] += v * (current - self._seen_so_far)
  7/114 [>.............................] - ETA: 2:33 - loss: nan

*******************************************
Accuracy: 370 / 1605  0.23052959501557632
*******************************************


2021-12-21 21:03:29,967 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

How about training with another half?

So, I reduced trains set from 9k to only 1k but still got that same error. Not sure what is wrong

May I know if you can run successfully with the default setting mentioned in LPRnet jupyter notebook which will train openALPR dataset?

To narrow down, could you please just try small part of images/labels? For example, 30 or 50.

I’ll try on small 200 images train dataset and if fails I’ll share the same with you guys so you can reproduce the same at your end.

Yes, please go ahead. Thanks a lot.

I trained on 200 images subset and it was working, what could cause this issue in full dataset? any idea what can I do to fix it?

Just try to find the culprit images/labels.

IS there something I can change in code inside tao container training code to get batch of images being trained to be printed in logs. So I at least shortlist some batch of images giving issues?

Also, is it normal to get such logs in case of bad dataset or bad training

Sorry, there is not parameter to set which images are training.

I suggest using bisect debug method.

This is not related.

Imp Update on this,
So I debugged all my images and from 20k I reduced it to 1 image on which it failed with above error.

image in itself is good, it is the label that is an issue. my length of label is 20 - eg 03100284403828848048

if I run with this in the train set it fails. I tried training with all possible learning rates big small it always fails.

but if I keep my length 18 and remove the last 2 values and make it:
031002844038288480 it trains perfectly. regardless of what lr I keep.

I find this very strange, do you have any idea why this could happen?
also my max_label_length: 25

How about other labels which are also 20 ? Are they fine? In other words, only this label failed?