How does resume work when a model fails to train (Following "Re-training SSD-Mobilenet" tutorial"]

jeheesom · March 3, 2021, 6:32pm

I am following the tutorial available at jetson-inference/pytorch-ssd.md at master · dusty-nv/jetson-inference · GitHub to retrain a model based on ssd mobile net and the fruits class.

When a model fails to complete training what is the correct way to use --resume to resume from a checkpoint.

For example my training failed at epoch 8 out of 30, error:core dumped. This was due to low memory.

I used the following command to resume:
python3 train_ssd.py --data=data/fruit1000 --model-dir=models/fruit1000_4 --resume=models/fruit1000_4/mb1-ssd-Epoch-8-Loss-5.5926465432937835.pth --batch-size=2 --workers=0 --epochs=30

Where “models/fruit1000_4/mb1-ssd-Epoch-8-Loss-5.5926465432937835.pth” is the path to the latest epoch created before failure.

The model resumes training from epoch 0, saving files starting from epoch 0 with similar loss values as the first attempt.

How do I set it to resume from epoch 8.

dusty_nv · March 3, 2021, 6:42pm

Hi @jeheesom, the train_ssd.py script doesn’t know what the previous epoch was, so it restarts at 0. I recommend specifying a new --model-dir when you resume training so the checkpoint files remain separate between runs. Also, if you are having problem with loss after resuming, you can try using the --pretrained-ssd=models/fruit1000_4/mb1-ssd-Epoch-8-Loss-5.5926465432937835.pth flag instead of --resume

jeheesom · March 3, 2021, 7:15pm

Great, thanks for the help

Topic		Replies	Views
Jetson Inference Training resume Jetson Nano jetson-inference	3	674	October 15, 2021
How to resume pytorch trainning in Jetson nano? Jetson Nano	8	2198	October 14, 2021
What should I use for the parameter “resume” in train_ssd.py TensorRT jetson-inference	1	399	June 23, 2021
Jetson nano - train model for my own object detection Jetson Nano ai-training	11	4498	October 15, 2021
What should I use for the parameter “resume” in train_ssd.py? Jetson Nano jetson-inference	3	823	June 25, 2021
Jetson-inference resume training ssd Jetson Nano jetson-inference	3	579	January 30, 2024
How do I re-train my model? Jetson Nano jetson-inference , ai-training	6	1533	August 29, 2021
Retraining ssd_mobilenet on Jetson nano Jetson Nano neural-network-framework	12	1523	October 15, 2021
Retrain ssd mobilnet Jetson Nano jetson-inference , ssd , ai-training	2	1117	October 15, 2021
Mobilenet-V1 training Jetson Nano jetson-inference	5	622	February 23, 2022

How does resume work when a model fails to train (Following "Re-training SSD-Mobilenet" tutorial"]

Related topics