Nan values appears while training Yolov4 using resnet 18 pretrained model

user86169 · December 8, 2021, 10:09am

TLT Version : docker_tag: v3.21.08-py3
config file : spec.txt (2.3 KB)
Training log file : yolov4_training_log_resnet18.csv (1.2 KB)
Terminal output saved in txt: terminal file.txt (45.7 KB)

There are only three classes I have for the annotated dataset that i am using listed below as well as mentioned in the config file:-
car,Truck,pedestrian

Hi,

I am a beginner in this tao toolkit. I am trying to train Yolov4 using resnet 18 pretrained model on a dataset in which i have 3 classes and 165 images. Few things that i want to know are listed below:-

While training started side by side i checked in the log file that nan values are occurs. I have also tried by reducing the learning rate but still it is showing the same nan values in the log file. So, i just want to ask how to overcome from this nan values issue?
In the config file Which parameters are responsible for nan values and what exactly need to be set into those parameters.
while in the 1st epoch it says UserWarning: Method on_batch_end() is slow compared to the batch update (1.964731). Check your callbacks.
% delta_t_median). is that an error if it is an error it will create any kind of issue? and how to overcome from this issue.

Epoch 1/80
2/8 [======>…] - ETA: 2:00 - loss: 15.2953/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.964731). Check your callbacks.
% delta_t_median)

looking forward to hear from your side. Thanks!

Morganh · December 8, 2021, 12:26pm

Epoch 1/80
2/8 [======>.......................] - ETA: 2:00 - loss: 15.2953/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.964731). Check your callbacks.
  % delta_t_median)
7/8 [=========================>....] - ETA: 6s - loss: 15.1834

No, it is not an error.
More, can you continue the training? Your attached training log as above does not show NaN error.

user86169 · December 8, 2021, 6:00pm

spec.txt (2.3 KB)

Hi,

I started to continue the training part and no of epochs goes well but look in the screen shot of logfile as it shows nan values are occuring. So, how to overcome from this issue?

at the time of model saving after each 10th epoch it shows the numeric value but for rest no of the epochs why it is showing nan in the training log file? I have also attached the config file for your reference.

Morganh · December 9, 2021, 3:10am

Since you set “checkpoint_interval: 10”, the validation will not run at 2nd or 3rd or 4th epoch, etc.
You can set to checkpoint_interval: 1 , there will be no nan result in the sheet.

user86169 · December 9, 2021, 6:39am

Hi,

I set it from “checkpoint_interval: 10” to “checkpoint_interval: 1”. it runs as well.
but now it is validating and saving weights at each epoch due to this no of weight files increases and I want to validate and save weight file at each 10 epochs without any nan values in the log file. How we can do that? and how nan values are connected with this check interval?

Morganh · December 9, 2021, 8:01am

Actually the nan values in the log file just mean “not available” because the AP/mAP/val_loss is not available if validation is not triggered.
You can delete them.

user86169 · December 9, 2021, 5:01pm

means instead of set checkpoint_interval: 1, if I set “checkpoint_interval: 10” then appearing nan values in the log file dosen’t affect on training part right?

Morganh · December 10, 2021, 1:55am

The nan value in the sheet does not affect training. It just means “not available” value for AP/mAP/val_loss.

user86169 · December 13, 2021, 4:26am

ok Got it. Now after the inference i got map: 96% and some of the images having multiple bounding boxes for a single object. so what should i need to do to avoid this issue?

Morganh · December 13, 2021, 6:54am

The mAP is 96%. And do you mean other 4% have multiple bounding boxes for a single object ?

user86169 · December 13, 2021, 7:40am

no i am just saying that after the inferencing done i saw that some of the images got multiple bounding boxes with the class probabilities and some of the images are predicted wrong so, at this moment what i should i need to do i also applied theresold value=0.5 but still it persists same. can u pls help me to overcome from this issue

Morganh · December 13, 2021, 8:06am

For “multiple bounding boxes”, can you share an example image?

user86169 · December 13, 2021, 8:18am

multiple bounding boxes :
multiple bouding boxes: 187
Correct: 131
Config File : Uploading: spec.txt…

I am using custom object detection using iphone products dataset also attached the config file. all images annotations are correct there are total 586 images i am using. but after inference multiple bounding boxes are appearing for a single object and some images are predicted wrong

Morganh · December 13, 2021, 10:18am

I cannot open the latest spec.txt file. Is it the same as the top of this topic. There is also a spec file.

user86169 · December 13, 2021, 4:41pm

No, not same. Its different from the top one. ok I have reattached the file for your reference.

Configuration file: config file.txt (5.2 KB)

Morganh · December 15, 2021, 5:29pm

Can you set a larger “-t” and retry?
-t, --draw_conf_thres : Threshold for drawing a bbox

system · December 29, 2021, 5:29pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is it possible to adjust class_weight in YOLOv4 like DetectNet v2? TAO Toolkit	7	1250	October 12, 2021
Tao-toolkit Yolov4_tiny Custom dataset TAO Toolkit	8	558	April 26, 2022
Why are there still so many trainable parameters even after freezing all the layers? TAO Toolkit	8	445	February 6, 2024
Monitoring with tensorboard for yolov3 training not working TAO Toolkit	3	160	June 11, 2024
Config file is not saved on wandb TAO Toolkit	7	365	March 21, 2024
Train yolov3 TAO Toolkit	21	751	October 12, 2021
YOLOv4 accuracy difference between TAO and Darknet TAO Toolkit	5	1510	October 12, 2021
Yolov5 custom data training on Jetson nano Maxine camera , yolo , jetson	0	1150	November 30, 2021
Tao toolkit training yolov4 model, YoloV3Datasetconfig has no field named "class_weighting_config" error TAO Toolkit yolo , ai-training	8	638	September 5, 2022
Got Bad result after inference command TAO Toolkit	28	1295	March 28, 2022

Nan values appears while training Yolov4 using resnet 18 pretrained model

Related topics