Classification checkpoints

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) - GPU
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) - Classification
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)- v3.22.05-tf1.15.5-py3


I was running tao image classification training. In tao training for checkpoints there is flag --init_epoch whenver i put 1 value then only training will started and if i changed the value then it’s giving attached error. I want to know how to save checkpoints in image classification.And how can i resume my training from checkpoints.
Suggestions would be appreciated …

The init_epoch is the number of epoch to resume training.
To resume from a checkpoint, use --init_epoch along with your checkpoint configured in the spec file.

can you elaborate what is specific change in spec file

Please check below.

can you share your classification_spec.cfg

I did not try to reproduce your experiment. So, you can download the spec file from jupyter notebook. And mostly important thing, as mentioned above, please make sure that the model_path in the spec file is updated to the .tlt file of the corresponding epoch you wish to resume from.

I started training and stopped training after 50 epochs has been done and kept --init_epoch 50 and changed the model path to this model_path: “/workspace/tao-experiments/classification/output/weights/resnet_050.tlt”. but still getting same error is there any need to change pretrained_model_path??

Please share your latest spec file.

classification_spec.cfg (1.2 KB)

In your training spec, below line is not expected.

pretrained_model_path: “/workspace/tao-experiments/classification/pretrained_resnet18/pretrained_classification_vresnet18/resnet_18.hdf5”

Please set

pretrained_model_path: “your_tlt_model”

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.