Train_ssh.py only works with one dataset; other one returns Loss: nan

moritz3 · April 13, 2021, 12:57am

Hey,
I ran into another problem trying to train a new model with my own data. I’m trying to create a model to track a climbing helmet. I’ve successfully trained the model with about 1000 images which I extraced from a video file.

Now I wanted to improve the detection rate and created another set of images which show the helmet from all possible angles. When I try to train these files I get:

2021-04-12 20:45:14 - Epoch: 0, Validation Loss: nan, Validation Regression Loss nan, Validation Classification Loss: nan

To make sure my dataset is in the correct format, I created two simple test-folders. One with an image + annotations from the first training session which works and a second one from my new set which returns nan.

I’ve attached a zip-file with the datasets. ssd_training_test.zip (1.6 MB)

I’m running them with the following command line:

Test1 works fine:
python3 train_ssd.py --dataset-type=voc --data=data/test1 --model-dir=models/test1 --num-epochs=100 --num-workers=10 --batch-size=1

Test2 returns nan:
python3 train_ssd.py --dataset-type=voc --data=data/test2 --model-dir=models/test2 --num-epochs=100 --num-workers=10 --batch-size=1

I have no clue where my error is? Anybody has any ideas?

dusty_nv · April 13, 2021, 1:14am

Hi @moritz3, typically when NaN’s occur, there is something malformed/corrupted about one or more image annotations in the training dataset. Perhaps the bounding box values are very large (overflow) or negative.

I would uncomment this line of code: https://github.com/dusty-nv/pytorch-ssd/blob/e7b5af50a157c50d3bab8f55089ce57c2c812f37/vision/datasets/voc_dataset.py#L76

And then run train_ssd.py with --batch-size=1 --num-workers=1 --debug-steps=1

This will then print out which image is loaded, and you should look for when the loss becomes NaN. The most recent image to be loaded is the one that caused the NaN - check it’s annotations.

moritz3 · April 13, 2021, 2:36pm

Thanks for your answer! I could figure out where my Annotation-data is causing issues. Sometimes x/ymin and x/ymax are reversed; so the min number is actually bigger than the max-number.

I’ll have to make sure my code produces the correct annotation data. That fixed my issues.

Thanks again!

dusty_nv · April 13, 2021, 6:48pm

OK cool, glad that you were able to find the problematic annotations and get it training!

Topic		Replies	Views
Nan error while training custom dataset Jetson Nano jetson-inference	8	1096	August 26, 2021
NAN during training on custom dataset Jetson Nano ai-training	11	780	October 15, 2021
Successful training with "train_ssd.py" using small custom data set, but error on full data set Jetson Nano ai-training	6	1910	October 18, 2021
Data corruption when running train_ssd script Jetson Nano python , training	10	1046	September 12, 2022
Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN Jetson Nano ai-training	2	970	September 13, 2022
Problems with train_ssd.py Jetson Nano	2	1060	October 14, 2021
Train_ssd.py dosen't work with pascal voc dataset Jetson Nano ai-training	5	1212	February 9, 2022
Meeting error in episode 5(Training Object Detection Models) Jetson Nano ai-training	2	569	July 12, 2022
Train_ssd.py indices error Jetson Nano jetson-inference	12	1835	December 15, 2021
Training with "train_ssd.py" - error at the end of the dataset Jetson AGX Xavier	6	1322	October 18, 2021

Train_ssh.py only works with one dataset; other one returns Loss: nan

Related topics