How does resume work when a model fails to train (Following "Re-training SSD-Mobilenet" tutorial"]

I am following the tutorial available at jetson-inference/pytorch-ssd.md at master · dusty-nv/jetson-inference · GitHub to retrain a model based on ssd mobile net and the fruits class.

When a model fails to complete training what is the correct way to use --resume to resume from a checkpoint.

For example my training failed at epoch 8 out of 30, error:core dumped. This was due to low memory.

I used the following command to resume:
python3 train_ssd.py --data=data/fruit1000 --model-dir=models/fruit1000_4 --resume=models/fruit1000_4/mb1-ssd-Epoch-8-Loss-5.5926465432937835.pth --batch-size=2 --workers=0 --epochs=30

Where “models/fruit1000_4/mb1-ssd-Epoch-8-Loss-5.5926465432937835.pth” is the path to the latest epoch created before failure.

The model resumes training from epoch 0, saving files starting from epoch 0 with similar loss values as the first attempt.

How do I set it to resume from epoch 8.

1 Like

Hi @jeheesom, the train_ssd.py script doesn’t know what the previous epoch was, so it restarts at 0. I recommend specifying a new --model-dir when you resume training so the checkpoint files remain separate between runs. Also, if you are having problem with loss after resuming, you can try using the --pretrained-ssd=models/fruit1000_4/mb1-ssd-Epoch-8-Loss-5.5926465432937835.pth flag instead of --resume

1 Like

Great, thanks for the help