Error training with jetson-inference

Hello,
I am trying to train with jetson-inference example and exporting the model but I get this error:

I noticed that there’s something weird happening as the avg loss, classification and regression are output-ing as “nan” halfway through the epoch. Not sure why this is happening.

Please help

Hi @user122459, normally onnx_export.py will select your best model (in the case of SSD models, the one with the lowest loss in the filename), however since the loss is NaN it is unable to do this. So you can run it manually like so:

$ python3 onnx_export.py --input=models/detections/mb1-ssd-Epoch-0-Loss-nan.pth --labels=models/detections/labels.txt --output=models/detections/ssd-mobilenet.onnx

However, this issue with the inf/nan losses will mean that your model is unlikely to detect your objects correctly. Typically you want to debug which item(s) in your training dataset are causing the inf/nan loss. To do this, I recommend uncommenting this line of code:

https://github.com/dusty-nv/pytorch-ssd/blob/3f9ba554e33260c8c493a927d7c4fdaa3f388e72/vision/datasets/voc_dataset.py#L76

And then running train_ssd.py with the options --batch-size=1 --workers=1 --debug-steps=1
Then the image ID that gets printed out directly before the inf/nan loss is the one that is causing the issue.
Then you can drill down and inspect that image’s XML file to see if anything is awry (or remove it)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.