How to resume pytorch trainning in Jetson nano?

To stop training at any time, you can press Ctrl+C. You can also restart the training again later using the --resume and --epoch-start flags, so you don’t need to wait for training to complete before testing out the model. (jetson-inference/pytorch-cat-dog.md at master · dusty-nv/jetson-inference · GitHub)

<b>This is the command usage:</b>
usage: train.py [-h] [--model-dir MODEL_DIR] [-a ARCH] [--resolution N] [-j N]
                [--epochs N] <b>[--start-epoch N]</b> [-b N] [--lr LR] [--momentum M]
                [--wd W] [-p N] <i><b>[--resume PATH]</b></i> [-e] [--pretrained]
                [--world-size WORLD_SIZE] [--rank RANK] [--dist-url DIST_URL]
                [--dist-backend DIST_BACKEND] [--seed SEED] [--gpu GPU]
                [--multiprocessing-distributed]
                DIR
python3 train.py --model-dir=plants <b>[--resume PATH]</b> <b>[--epoch-start N]</b> ~/datasets/PlantCLEF_Subset

My question is:
–resume PATH, which PATH should I use?
–epoch-start N, What’s N?

Hi MilesW, PATH should be the PATH to the last model checkpoint that was saved (plants/checkpoint.pth.tar)

–epoch-start is optional and not strictly required, as it should load the epoch number from the checkpoint that you load with --resume (it represents the last epoch that was run). So you should be able to skip specifying the --epoch-start argument (it automatically does that here in the code)

So see if you can run this:

python3 train.py --model-dir=plants --resume plants/checkpoint.pth.tar ~/datasets/PlantCLEF_Subset

Note that it will resume training starting on the epoch that you last left off, up to the number of epochs specified by the --epochs argument (default is 35 epochs). So if you left off training at epoch 15, it would train for 20 more epochs by default (up to 35 epochs). --epochs represents the total number of epochs to train for across training sessions and resumes, and is not limited to the number of epochs to train one session for.

1 Like

Hi dusty_nv

The code can work well.
Thanks.

python3 train.py --model-dir=plants --resume plants/checkpoint.pth.tar ~/datasets/PlantCLEF_Subset

I’m trying to resume a model called scj and cannot find a checkpoint.pth.tar document anywhere in the data models or vision directories. The new model and training has all worked fine I just wish to add to it. - Thanks

Hi @nic_wren, are there other .pth.tar files in the folder of your trained model? For example, best_model.pth.tar. There should be some PyTorch model files to the directory that you saved your model to when you originally trained it.

Thanks Dusty for getting back to me. I can find no tar files anywhere. I can find plenty of files named mb1-ssd-Epach-0-Loss-6.268204510211945.pth with no. .tar and definable non named as you state. I followed your tutorial training tractors to the letter (just not with tractors) everything worked well. The only difference being when i ran the model it took a heck of a lot longer to start up than on your demo - so long i very nearly turned it off! This was all done on a fresh installed sd card. Wishing you and your family a merry Christmas

Ah ok, that’s normal - the pytorch-ssd code (for object detection training) only saves the .pth files (not pth.tar). The post above that references the checkpoint.pth.tar file is referring to the classification training script. So you would want to --resume based on your last mb1-ssd-Epoch-*.pth file (the one with the highest epoch number).

I believe that when pytorch-ssd resumes, it will think it started again from epoch 0 (even though in fact it is on whatever epoch + 1 you resumed from), which will lead it to saving more models as mb1-ssd-Epoch-0-*.pth, mb1-ssd-Epoch-1-*.pth, ect. So when resuming, you probably want to pick a new --model-dir to keep your models straight.

Thank you for the holiday wishes, and wish you and your family a Merry Christmas as well.

1 Like

thanks for your hep - all the best.