Nan error while training custom dataset

Hi, I am getting a nan error when training a 500 image custom dataset. I investigated and the problem could be the learning rate. Could someone help me adjust the learning rate? I am using the Ello AI wold method.

You can change the learning rate using the --learning-rate argument to train_ssd.py

However it may be more likely that you have some malformed/corrupt annotation(s) in your dataset. What I recommend is to uncomment this line of code:

https://github.com/dusty-nv/pytorch-ssd/blob/8ed842a408f8c4a8812f430cf8063e0b93a56803/vision/datasets/voc_dataset.py#L76

And then run train_ssd.py with --batch-size 1 --debug-steps 1

Then when you first see the NaN appear, check the most recent image to be loaded. Then go inspect that image’s XML annotations to confirm that they are valid.

You are the best dusty, I will try that thanks.

So, I am getting nan in random images. It starts at image 300/500, or 420/500 or 500/500 it gives me nan in a different image every time i run it.In addition, when I run it as normal the appear in epoch 27 and then closer and closer to epoch 0 with every try.

Since, the problem persisted I collected another dataset. This new dataset works without nans. From now and on I will train the network every time I add more images to the dataset, that way I can figure out which batch of annotations is corrupted.👍

This is probably because the training set is loaded in random order by PyTorch (which is done to improve the training results). However you can change this by changing to shuffle = False on this line:

https://github.com/dusty-nv/pytorch-ssd/blob/8ed842a408f8c4a8812f430cf8063e0b93a56803/train_ssd.py#L235

The data will then always be loaded in the same order.

OK great, glad you got it working! If you get NaN in the future, uncomment that line of code I linked to above that prints out the image IDs as they are loaded. Then set shuffle = False and run with --batch-size=1 --debug-steps 1, and you will be able to see which image(s) cause the NaN to occur.

You know what, I am going to fallow those instructions and fix the pervious dataset too. It is a 500 images 2000 annotations that took time and I don’t want it to go waste. I am going to try the shuffle = False and if it works I am going to marge the files with the new dataset. I will let you know how it went.

So, it does not work. the nan appear randomly. No big deal, I already collected another functional dateset. Thanks.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.