Nan values appears while training Yolov4 using resnet 18 pretrained model

TLT Version : docker_tag: v3.21.08-py3
config file : spec.txt (2.3 KB)
Training log file : yolov4_training_log_resnet18.csv (1.2 KB)
Terminal output saved in txt: terminal file.txt (45.7 KB)

There are only three classes I have for the annotated dataset that i am using listed below as well as mentioned in the config file:-
car,Truck,pedestrian

Hi,

I am a beginner in this tao toolkit. I am trying to train Yolov4 using resnet 18 pretrained model on a dataset in which i have 3 classes and 165 images. Few things that i want to know are listed below:-

  1. While training started side by side i checked in the log file that nan values are occurs. I have also tried by reducing the learning rate but still it is showing the same nan values in the log file. So, i just want to ask how to overcome from this nan values issue?

  2. In the config file Which parameters are responsible for nan values and what exactly need to be set into those parameters.

  3. while in the 1st epoch it says UserWarning: Method on_batch_end() is slow compared to the batch update (1.964731). Check your callbacks.
    % delta_t_median). is that an error if it is an error it will create any kind of issue? and how to overcome from this issue.

Epoch 1/80
2/8 [======>…] - ETA: 2:00 - loss: 15.2953/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.964731). Check your callbacks.
% delta_t_median)

looking forward to hear from your side. Thanks!

Epoch 1/80
2/8 [======>.......................] - ETA: 2:00 - loss: 15.2953/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.964731). Check your callbacks.
  % delta_t_median)
7/8 [=========================>....] - ETA: 6s - loss: 15.1834

No, it is not an error.
More, can you continue the training? Your attached training log as above does not show NaN error.

spec.txt (2.3 KB)

Hi,

I started to continue the training part and no of epochs goes well but look in the screen shot of logfile as it shows nan values are occuring. So, how to overcome from this issue?

at the time of model saving after each 10th epoch it shows the numeric value but for rest no of the epochs why it is showing nan in the training log file? I have also attached the config file for your reference.

Since you set “checkpoint_interval: 10”, the validation will not run at 2nd or 3rd or 4th epoch, etc.
You can set to checkpoint_interval: 1 , there will be no nan result in the sheet.

Hi,

I set it from “checkpoint_interval: 10” to “checkpoint_interval: 1”. it runs as well.
but now it is validating and saving weights at each epoch due to this no of weight files increases and I want to validate and save weight file at each 10 epochs without any nan values in the log file. How we can do that? and how nan values are connected with this check interval?

Actually the nan values in the log file just mean “not available” because the AP/mAP/val_loss is not available if validation is not triggered.
You can delete them.

means instead of set checkpoint_interval: 1, if I set “checkpoint_interval: 10” then appearing nan values in the log file dosen’t affect on training part right?

The nan value in the sheet does not affect training. It just means “not available” value for AP/mAP/val_loss.

ok Got it. Now after the inference i got map: 96% and some of the images having multiple bounding boxes for a single object. so what should i need to do to avoid this issue?

The mAP is 96%. And do you mean other 4% have multiple bounding boxes for a single object ?

no i am just saying that after the inferencing done i saw that some of the images got multiple bounding boxes with the class probabilities and some of the images are predicted wrong so, at this moment what i should i need to do i also applied theresold value=0.5 but still it persists same. can u pls help me to overcome from this issue

For “multiple bounding boxes”, can you share an example image?

multiple bounding boxes :16
multiple bouding boxes: 187
Correct:131
Config File : Uploading: spec.txt…

I am using custom object detection using iphone products dataset also attached the config file. all images annotations are correct there are total 586 images i am using. but after inference multiple bounding boxes are appearing for a single object and some images are predicted wrong

I cannot open the latest spec.txt file. Is it the same as the top of this topic. There is also a spec file.

No, not same. Its different from the top one. ok I have reattached the file for your reference.

Configuration file: config file.txt (5.2 KB)

Can you set a larger “-t” and retry?
-t, --draw_conf_thres : Threshold for drawing a bbox

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.