Train_ssh.py only works with one dataset; other one returns Loss: nan

Hey,
I ran into another problem trying to train a new model with my own data. I’m trying to create a model to track a climbing helmet. I’ve successfully trained the model with about 1000 images which I extraced from a video file.

Now I wanted to improve the detection rate and created another set of images which show the helmet from all possible angles. When I try to train these files I get:

2021-04-12 20:45:14 - Epoch: 0, Validation Loss: nan, Validation Regression Loss nan, Validation Classification Loss: nan

To make sure my dataset is in the correct format, I created two simple test-folders. One with an image + annotations from the first training session which works and a second one from my new set which returns nan.

I’ve attached a zip-file with the datasets. ssd_training_test.zip (1.6 MB)

I’m running them with the following command line:

Test1 works fine:
python3 train_ssd.py --dataset-type=voc --data=data/test1 --model-dir=models/test1 --num-epochs=100 --num-workers=10 --batch-size=1

Test2 returns nan:
python3 train_ssd.py --dataset-type=voc --data=data/test2 --model-dir=models/test2 --num-epochs=100 --num-workers=10 --batch-size=1

I have no clue where my error is? Anybody has any ideas?

Hi @moritz3, typically when NaN’s occur, there is something malformed/corrupted about one or more image annotations in the training dataset. Perhaps the bounding box values are very large (overflow) or negative.

I would uncomment this line of code: https://github.com/dusty-nv/pytorch-ssd/blob/e7b5af50a157c50d3bab8f55089ce57c2c812f37/vision/datasets/voc_dataset.py#L76

And then run train_ssd.py with --batch-size=1 --num-workers=1 --debug-steps=1

This will then print out which image is loaded, and you should look for when the loss becomes NaN. The most recent image to be loaded is the one that caused the NaN - check it’s annotations.

Thanks for your answer! I could figure out where my Annotation-data is causing issues. Sometimes x/ymin and x/ymax are reversed; so the min number is actually bigger than the max-number.

I’ll have to make sure my code produces the correct annotation data. That fixed my issues.

Thanks again!

OK cool, glad that you were able to find the problematic annotations and get it training!