I am following the tutorial available at jetson-inference/pytorch-ssd.md at master · dusty-nv/jetson-inference · GitHub to retrain a model based on ssd mobile net and the fruits class.
When a model fails to complete training what is the correct way to use --resume to resume from a checkpoint.
For example my training failed at epoch 8 out of 30, error:core dumped. This was due to low memory.
I used the following command to resume:
python3 train_ssd.py --data=data/fruit1000 --model-dir=models/fruit1000_4 --resume=models/fruit1000_4/mb1-ssd-Epoch-8-Loss-5.5926465432937835.pth --batch-size=2 --workers=0 --epochs=30
Where “models/fruit1000_4/mb1-ssd-Epoch-8-Loss-5.5926465432937835.pth” is the path to the latest epoch created before failure.
The model resumes training from epoch 0, saving files starting from epoch 0 with similar loss values as the first attempt.
How do I set it to resume from epoch 8.