LPRNet: Invalid loss, terminating training

alaapdhall79 · December 21, 2021, 2:31pm

• Hardware (RTX 3090)
• Network Type (LprNet)
• TLT Version dockers: [['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm'] format_version: 2.0 toolkit_version: 3.21.11 published_date: 11/08/2021 ]
• Training spec file (
tutorial_spec_scratch.txt (1.3 KB)
)
• How to reproduce the issue?

I am trying to train an LPRNet model and it is crashing with the following error:

Non-trainable params: 7,608                                                                                                                                             
__________________________________________________________________________________________________                                                                      
2021-12-21 13:59:08,637 [INFO] __main__: Number of images in the training dataset:        9038                                                                          
2021-12-21 13:59:08,638 [INFO] __main__: Number of images in the validation dataset:      1537                                                                          
Epoch 1/100                                                                                                                                                             
  1/565 [..............................] - ETA: 1:03:53 - loss: 23.7524WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.521408). C
heck your callbacks.                                                                                                                                                    
2021-12-21 13:59:15,803 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (0.521408). Check your callbacks.                        
 17/565 [..............................] - ETA: 4:06 - loss: 21.9864Batch 17: Invalid loss, terminating training                                                        
 18/565 [..............................] - ETA: 3:54 - loss: inf    c3adbbcc58e0:64:98 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>                                  
c3adbbcc58e0:64:98 [0] NCCL INFO NET/Plugin : Plugin load returned 12 : libnccl-net.so: cannot open shared object file: No such file or directory.                      
c3adbbcc58e0:64:98 [0] NCCL INFO NET/IB : No device found.                                                                                                              
c3adbbcc58e0:64:98 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>                                                                            
c3adbbcc58e0:64:98 [0] NCCL INFO Using network Socket                                                                                                                   
NCCL version 2.9.9+cuda11.3                                                                                                                                             
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 00/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 01/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 02/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 03/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 04/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 05/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 06/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 07/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 08/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 09/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 10/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 11/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 12/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 13/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 14/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 15/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 16/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 17/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 18/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 19/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 20/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 21/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 22/32 :    0                                                0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 23/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 24/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 25/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 26/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 27/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 28/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 29/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 30/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Channel 31/32 :    0                                                                                                                   
c3adbbcc58e0:64:98 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1
/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1
->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1
->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1
->0->-1 [31] -1/-1/-1->0->-1              
c3adbbcc58e0:64:98 [0] NCCL INFO Connected all rings                                                                                                                    
c3adbbcc58e0:64:98 [0] NCCL INFO Connected all trees                                                                                                                    
c3adbbcc58e0:64:98 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer                                                                            
c3adbbcc58e0:64:98 [0] NCCL INFO comm 0x7f134dd56cc0 rank 0 nranks 1 cudaDev 0 busId 9000 - Init COMPLETE
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/generic_utils.py:372: RuntimeWarning: invalid value encountered in multiply
  self._values[k][0] += v * (current - self._seen_so_far)                                                                                                               
 18/565 [..............................] - ETA: 4:21 - loss: nan                                                                                                        

*******************************************                                         
Accuracy: 373 / 1537  0.24268054651919324                                           
*******************************************                                         


2021-12-21 19:29:40,169 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

FYI: I am using official us_lp_charaters.txt with 35 letter, however, my original dataset looks something like this: max length is 20 and char dict is {'3': 5327, '0': 8335, '6': 5239, '9': 5075, '2': 6157, '7': 5011, '1': 5846, '8': 6077, '4': 6537, '5': 5167, 'U': 113, 'P': 113, 'C': 113, 'E': 1}) i.e I only have [0,1,2,3,4,5,6,7,8,9,U,P,C,E] in my dataset. I, however, didn’t change official us_lp_charaters.txt

I changed max_len to 20 as that was the max length in my dataset.

I am not sure what’s causing this. I tried LearningRate as low as with bs 16
min_learning_rate: 1e-6 max_learning_rate: 1e-5

Thank you!

Morganh · December 21, 2021, 2:50pm

Could you try to modify us_lp_charaters.txt and train again? Thanks.

alaapdhall79 · December 21, 2021, 2:56pm

Did that same thing.

I tried updating us_lp_char.txt
I tried decreasing soft_lr to .0001 from .001 but the same error.
I tried removing that E:1 from my dataset as it seemed outlier but still the same.

train command I am using:
tao lprnet train --gpus=1 --gpu_index=0 --use_amp \ -e /workspace/tao-experiments/lprnet/specs/tutorial_spec_scratch.txt \ -r /workspace/tao-experiments/lprnet/experiments/experiment_dir_unpruned \ -k nvidia_tlt \ -m /workspace/tao-experiments/lprnet/pretrained_lprnet_baseline18/lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt
I even trained without pretrained model -m but still the same

Morganh · December 21, 2021, 3:01pm

Could you please trigger below experiments to narrow down?

Without “–use_amp” and retry
If failed, please try to split the dataset to half and retry.

alaapdhall79 · December 21, 2021, 3:03pm

I tried without amp but same error, also 1 update
I have {‘\n’: 6629} char in my dataset. i.e in im.txt it will have a \n at the end.
I hope we are stripping the labels before training inside tao? So this shouldn’t be a problem, right?

Morganh · December 21, 2021, 3:13pm

Could you share an example of your label file?

alaapdhall79 · December 21, 2021, 3:25pm

Never mind, I removed the null char at the end and still got the error

alaapdhall79 · December 21, 2021, 3:34pm

I tried with half dataset also randomly splitted. But still got the same.

Logs of half dataset

__________________________________________________________________________________________________                                                                      
lstm (LSTM)                     (None, 24, 512)      8423424     flatten_feature[0][0]                                                                                  
__________________________________________________________________________________________________                                                                      
td_dense (TimeDistributed)      (None, 24, 36)       18468       lstm[0][0]                                                                                             
__________________________________________________________________________________________________                                                                      
softmax (Softmax)               (None, 24, 36)       0           td_dense[0][0]                                                                                         
==================================================================================================                                                                      
Total params: 14,432,480                                                                                                                                                
Trainable params: 14,424,872                                                                                                                                            
Non-trainable params: 7,608                                                                                                                                             
__________________________________________________________________________________________________                                                                      
2021-12-21 15:32:44,374 [INFO] __main__: Number of images in the training dataset:        3646                                                                          
2021-12-21 15:32:44,374 [INFO] __main__: Number of images in the validation dataset:      1605                                                                          
Epoch 1/100                                                                                                                                                             
  1/114 [..............................] - ETA: 15:07 - loss: 13.7407WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.550720). Che
ck your callbacks.                                                                                                                                                      
2021-12-21 15:32:52,963 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (0.550720). Check your callbacks.                        
  6/114 [>.............................] - ETA: 2:39 - loss: 19.4261Batch 6: Invalid loss, terminating training                                                         
  7/114 [>.............................] - ETA: 2:17 - loss: inf    73800b3ca595:64:98 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>                                  
73800b3ca595:64:98 [0] NCCL INFO NET/Plugin : Plugin load returned 12 : libnccl-net.so: cannot open shared object file: No such file or directory.                      
73800b3ca595:64:98 [0] NCCL INFO NET/IB : No device found.                                                                                                              
73800b3ca595:64:98 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>                                                                            
73800b3ca595:64:98 [0] NCCL INFO Using network Socket                                                                                                                   
NCCL version 2.9.9+cuda11.3                                                                                                                                             
73800b3ca595:64:98 [0] NCCL INFO Channel 00/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 01/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 02/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 03/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 04/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 05/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 06/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 07/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 08/32 :    0                                                                                                                   
73800b3ca595:64:98 [0] NCCL INFO Channel 09/32 :    0 
3800b3ca595:64:98 [0] NCCL INFO Channel 10/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 11/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 12/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 13/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 14/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 15/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 16/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 17/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 18/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 19/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 20/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 21/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 22/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 23/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 24/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 25/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 26/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 27/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 28/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 29/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 30/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Channel 31/32 :    0
73800b3ca595:64:98 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
73800b3ca595:64:98 [0] NCCL INFO Connected all rings
73800b3ca595:64:98 [0] NCCL INFO Connected all trees
73800b3ca595:64:98 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
73800b3ca595:64:98 [0] NCCL INFO comm 0x7f74c9d55b00 rank 0 nranks 1 cudaDev 0 busId 9000 - Init COMPLETE
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/generic_utils.py:372: RuntimeWarning: invalid value encountered in multiply
  self._values[k][0] += v * (current - self._seen_so_far)
  7/114 [>.............................] - ETA: 2:33 - loss: nan

*******************************************
Accuracy: 370 / 1605  0.23052959501557632
*******************************************


2021-12-21 21:03:29,967 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · December 21, 2021, 3:36pm

How about training with another half?

alaapdhall79 · December 21, 2021, 3:52pm

So, I reduced trains set from 9k to only 1k but still got that same error. Not sure what is wrong

Morganh · December 21, 2021, 3:56pm

May I know if you can run successfully with the default setting mentioned in LPRnet jupyter notebook which will train openALPR dataset?

To narrow down, could you please just try small part of images/labels? For example, 30 or 50.

alaapdhall79 · December 21, 2021, 3:58pm

I’ll try on small 200 images train dataset and if fails I’ll share the same with you guys so you can reproduce the same at your end.

Morganh · December 21, 2021, 4:06pm

Yes, please go ahead. Thanks a lot.

alaapdhall79 · December 21, 2021, 4:44pm

I trained on 200 images subset and it was working, what could cause this issue in full dataset? any idea what can I do to fix it?

Morganh · December 21, 2021, 4:46pm

Just try to find the culprit images/labels.

alaapdhall79 · December 21, 2021, 5:12pm

IS there something I can change in code inside tao container training code to get batch of images being trained to be printed in logs. So I at least shortlist some batch of images giving issues?

alaapdhall79 · December 21, 2021, 7:58pm

alaapdhall79:

Plugin load returned 12 : libnccl-net.so: cannot open shared object file: No such file or directory.                      
73800b3ca595:64:98 [0] NCCL INFO NET/IB : No device found.

Also, is it normal to get such logs in case of bad dataset or bad training

Morganh · December 22, 2021, 2:35am

Sorry, there is not parameter to set which images are training.

I suggest using bisect debug method.

This is not related.

alaapdhall79 · December 22, 2021, 8:40am

Imp Update on this,
So I debugged all my images and from 20k I reduced it to 1 image on which it failed with above error.

image in itself is good, it is the label that is an issue. my length of label is 20 - eg 03100284403828848048

if I run with this in the train set it fails. I tried training with all possible learning rates big small it always fails.

but if I keep my length 18 and remove the last 2 values and make it:
031002844038288480 it trains perfectly. regardless of what lr I keep.

I find this very strange, do you have any idea why this could happen?
also my max_label_length: 25

Morganh · December 22, 2021, 8:45am

How about other labels which are also 20 ? Are they fine? In other words, only this label failed?

Topic		Replies	Views
Error when training LPRNet TAO Toolkit tensorrt , cuda	10	1251	July 13, 2021
Error when training LPRNet DeepStream SDK	3	856	October 12, 2021
Error when training LPRNet 2 (characters number < 35) TAO Toolkit	6	1872	September 11, 2021
Multi GPU's and invalid loss TAO Toolkit	18	1175	July 19, 2022
WSL2 & TAO issues TAO Toolkit wsl , tao	27	3778	January 5, 2022
Yolo_v4_tiny randomly stops docker container during second or third validation phase with no errors TAO Toolkit yolo	20	880	August 29, 2022
Cannot train Tao Toolkit UNet model in version v4.0.0 and v4.0.1 TAO Toolkit tao	16	725	July 13, 2023
High ram usage with tlt ResNet TAO Toolkit	42	997	July 6, 2022
Loss, acc, val_acc get stablized soon in both train and re-train TAO Toolkit	6	438	July 3, 2023
Invalid Loss TAO Toolkit	31	1296	July 11, 2022

LPRNet: Invalid loss, terminating training

Related topics