Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) - GPU
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) - Classification
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)- v3.22.05-tf1.15.5-py3
Hello,
I was running tao image classification training. In tao training for checkpoints there is flag --init_epoch whenver i put 1 value then only training will started and if i changed the value then it’s giving attached error. I want to know how to save checkpoints in image classification.And how can i resume my training from checkpoints.
Suggestions would be appreciated …
The init_epoch is the number of epoch to resume training.
To resume from a checkpoint, use --init_epoch along with your checkpoint configured in the spec file.
I did not try to reproduce your experiment. So, you can download the spec file from jupyter notebook. And mostly important thing, as mentioned above, please make sure that the model_path in the spec file is updated to the .tlt file of the corresponding epoch you wish to resume from.
I started training and stopped training after 50 epochs has been done and kept --init_epoch 50 and changed the model path to this model_path: “/workspace/tao-experiments/classification/output/weights/resnet_050.tlt”. but still getting same error is there any need to change pretrained_model_path??