Jetson Inference Training resume

nerk · March 22, 2021, 4:57pm

Hello,
I started train my own model, everything goes fine until train crash during Epoch 6 (/30).
I’m tryinig to resume:

python3 train_ssd.py --data=data/pedestrian --model-dir=models/pedestrian --resume models/pedestrian/mb1-ssd-Epoch-5-Loss-5.712104982496332.pth

but it is starting from epoch 0:

2021-03-22 17:49:09 - Build network.
2021-03-22 17:49:10 - Resume from the model models/pedestrian/mb1-ssd-Epoch-5-Loss-5.712104982496332.pth
2021-03-22 17:49:10 - Took 0.56 seconds to load the model.
2021-03-22 17:49:25 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2021-03-22 17:49:25 - Uses CosineAnnealingLR scheduler.
2021-03-22 17:49:25 - Start training from epoch 0.

What I’m doing wrong?

dusty_nv · March 22, 2021, 6:45pm

Hi @nerk, the script doesn’t know which epoch it last ended on, so it restarts at epoch 0. However it loaded your partially-trained model so it has a better starting point than if it was actually starting from 0. What I recommend is specifying a new --model-dir so that you can keep the different training runs separate.

nerk · March 22, 2021, 6:49pm

Thank you, now I understood.