I noticed that there’s something weird happening as the avg loss, classification and regression are output-ing as “nan” halfway through the epoch. Not sure why this is happening.
Hi @user122459, normally onnx_export.py will select your best model (in the case of SSD models, the one with the lowest loss in the filename), however since the loss is NaN it is unable to do this. So you can run it manually like so:
However, this issue with the inf/nan losses will mean that your model is unlikely to detect your objects correctly. Typically you want to debug which item(s) in your training dataset are causing the inf/nan loss. To do this, I recommend uncommenting this line of code:
And then running train_ssd.py with the options --batch-size=1 --workers=1 --debug-steps=1
Then the image ID that gets printed out directly before the inf/nan loss is the one that is causing the issue.
Then you can drill down and inspect that image’s XML file to see if anything is awry (or remove it)