Hello everyone, I don’t know if someone has already commented on it, but in case it works for someone, I would like to comment on how to get the “epoch” number from the “checkpoint” number.
I’ve seen some posts where some users have “checkpoit_interval = 1” set. Therefore, at the end of the training you will have as many checkpoints as training epochs. In this case, a checkpoint can be directly associated with an epoch if it is ordered from lowest to highest (ref).
However when “checkpoint_interval> 1” it is a bit more complex to perform a direct association. However, the following formula can be used to obtain the epoch associated with a checkpoint.
where:
- batch_size_per_gpu: Same value configured in “training_config”
- val_split: Same value configured in the generation of tfrecords
- num_training_images: Total amount of images in dataset (trainval)
- checkpoint_number: Number with which each checkpoint is saved during training
For example if you have:
-batch_size_per_gpu = 3
-val_split = 0.2
-num_training_images = 1033
Then the correspondence between the following checkpoints and epochs is the following:
epoch: 0, checkpoint: trained_models/trained_model_90_0_1.0/model.step-0.tlt
epoch: 5, checkpoint: trained_models/trained_model_90_0_1.0/model.step-1380.tlt
epoch: 10, checkpoint: trained_models/trained_model_90_0_1.0/model.step-2760.tlt
epoch: 15, checkpoint: trained_models/trained_model_90_0_1.0/model.step-4140.tlt
epoch: 20, checkpoint: trained_models/trained_model_90_0_1.0/model.step-5520.tlt
epoch: 25, checkpoint: trained_models/trained_model_90_0_1.0/model.step-6900.tlt
epoch: 30, checkpoint: trained_models/trained_model_90_0_1.0/model.step-8280.tlt
epoch: 35, checkpoint: trained_models/trained_model_90_0_1.0/model.step-9660.tlt
epoch: 40, checkpoint: trained_models/trained_model_90_0_1.0/model.step-11040.tlt
epoch: 45, checkpoint: trained_models/trained_model_90_0_1.0/model.step-12420.tlt
epoch: 50, checkpoint: trained_models/trained_model_90_0_1.0/model.step-13800.tlt
epoch: 55, checkpoint: trained_models/trained_model_90_0_1.0/model.step-15180.tlt
epoch: 60, checkpoint: trained_models/trained_model_90_0_1.0/model.step-16560.tlt
epoch: 65, checkpoint: trained_models/trained_model_90_0_1.0/model.step-17940.tlt
epoch: 70, checkpoint: trained_models/trained_model_90_0_1.0/model.step-19320.tlt
epoch: 75, checkpoint: trained_models/trained_model_90_0_1.0/model.step-20700.tlt
epoch: 80, checkpoint: trained_models/trained_model_90_0_1.0/model.step-22080.tlt
epoch: 85, checkpoint: trained_models/trained_model_90_0_1.0/model.step-23460.tlt
epoch: 90, checkpoint: trained_models/trained_model_90_0_1.0/model.step-24840.tlt
Which is correct, since the training was done with “num_epochs = 90” and “checkpoint_interval = 5”. I hope someone will find this information useful. Cheers