Nan error while training custom dataset

cespedesk · June 18, 2021, 1:01pm

Hi, I am getting a nan error when training a 500 image custom dataset. I investigated and the problem could be the learning rate. Could someone help me adjust the learning rate? I am using the Ello AI wold method.

dusty_nv · June 18, 2021, 4:21pm

You can change the learning rate using the --learning-rate argument to train_ssd.py

However it may be more likely that you have some malformed/corrupt annotation(s) in your dataset. What I recommend is to uncomment this line of code:

https://github.com/dusty-nv/pytorch-ssd/blob/8ed842a408f8c4a8812f430cf8063e0b93a56803/vision/datasets/voc_dataset.py#L76

And then run train_ssd.py with --batch-size 1 --debug-steps 1

Then when you first see the NaN appear, check the most recent image to be loaded. Then go inspect that image’s XML annotations to confirm that they are valid.

cespedesk · June 18, 2021, 4:32pm

You are the best dusty, I will try that thanks.

cespedesk · June 18, 2021, 8:10pm

So, I am getting nan in random images. It starts at image 300/500, or 420/500 or 500/500 it gives me nan in a different image every time i run it.In addition, when I run it as normal the appear in epoch 27 and then closer and closer to epoch 0 with every try.

cespedesk · June 21, 2021, 1:46pm

Since, the problem persisted I collected another dataset. This new dataset works without nans. From now and on I will train the network every time I add more images to the dataset, that way I can figure out which batch of annotations is corrupted.👍

dusty_nv · June 21, 2021, 2:39pm

This is probably because the training set is loaded in random order by PyTorch (which is done to improve the training results). However you can change this by changing to shuffle = False on this line:

https://github.com/dusty-nv/pytorch-ssd/blob/8ed842a408f8c4a8812f430cf8063e0b93a56803/train_ssd.py#L235

The data will then always be loaded in the same order.

OK great, glad you got it working! If you get NaN in the future, uncomment that line of code I linked to above that prints out the image IDs as they are loaded. Then set shuffle = False and run with --batch-size=1 --debug-steps 1, and you will be able to see which image(s) cause the NaN to occur.

cespedesk · June 21, 2021, 4:15pm

You know what, I am going to fallow those instructions and fix the pervious dataset too. It is a 500 images 2000 annotations that took time and I don’t want it to go waste. I am going to try the shuffle = False and if it works I am going to marge the files with the new dataset. I will let you know how it went.

cespedesk · June 27, 2021, 1:43am

So, it does not work. the nan appear randomly. No big deal, I already collected another functional dateset. Thanks.

system · August 26, 2021, 1:44am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
NAN during training on custom dataset Jetson Nano ai-training	11	636	October 15, 2021
Train_ssh.py only works with one dataset; other one returns Loss: nan Jetson Nano ai-training	4	617	October 15, 2021
Error training and converting to onnx with custom dataset Jetson Nano ai-training , nano2gb	12	1268	October 15, 2021
PLEASE HELP: nvidia Jetson 2GB training fails - TypeError: __init__() missing 1 required positional argument: 'dtype' Jetson Nano ai-training	6	2387	March 2, 2022
Error in python train_ssd.py Jetson Nano ai-training	7	821	January 18, 2022
Error training with jetson-inference Jetson Nano ai-training	2	1323	April 4, 2022
Jetson Inference Custom Data Training Error Jetson Nano jetson-inference	14	1123	October 15, 2021
Training of Object Detection models on Jetson Nano! Jetson Nano ai-training	7	1262	October 18, 2021
Re-training on the Cat/Dog Dataset Jetson Nano jetson-inference	7	724	October 18, 2021
Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN Jetson Nano ai-training	2	922	September 13, 2022

Nan error while training custom dataset

Related topics