Hi MilesW, PATH should be the PATH to the last model checkpoint that was saved (plants/checkpoint.pth.tar)
–epoch-start is optional and not strictly required, as it should load the epoch number from the checkpoint that you load with --resume (it represents the last epoch that was run). So you should be able to skip specifying the --epoch-start argument (it automatically does that here in the code)
Note that it will resume training starting on the epoch that you last left off, up to the number of epochs specified by the --epochs argument (default is 35 epochs). So if you left off training at epoch 15, it would train for 20 more epochs by default (up to 35 epochs). --epochs represents the total number of epochs to train for across training sessions and resumes, and is not limited to the number of epochs to train one session for.
I’m trying to resume a model called scj and cannot find a checkpoint.pth.tar document anywhere in the data models or vision directories. The new model and training has all worked fine I just wish to add to it. - Thanks
Hi @nic_wren, are there other .pth.tar files in the folder of your trained model? For example, best_model.pth.tar. There should be some PyTorch model files to the directory that you saved your model to when you originally trained it.
Thanks Dusty for getting back to me. I can find no tar files anywhere. I can find plenty of files named mb1-ssd-Epach-0-Loss-6.268204510211945.pth with no. .tar and definable non named as you state. I followed your tutorial training tractors to the letter (just not with tractors) everything worked well. The only difference being when i ran the model it took a heck of a lot longer to start up than on your demo - so long i very nearly turned it off! This was all done on a fresh installed sd card. Wishing you and your family a merry Christmas
Ah ok, that’s normal - the pytorch-ssd code (for object detection training) only saves the .pth files (not pth.tar). The post above that references the checkpoint.pth.tar file is referring to the classification training script. So you would want to --resume based on your last mb1-ssd-Epoch-*.pth file (the one with the highest epoch number).
I believe that when pytorch-ssd resumes, it will think it started again from epoch 0 (even though in fact it is on whatever epoch + 1 you resumed from), which will lead it to saving more models as mb1-ssd-Epoch-0-*.pth, mb1-ssd-Epoch-1-*.pth, ect. So when resuming, you probably want to pick a new --model-dir to keep your models straight.
Thank you for the holiday wishes, and wish you and your family a Merry Christmas as well.