Get epoch associated with a checkpoint in Detectnet_V2

Hello everyone, I don’t know if someone has already commented on it, but in case it works for someone, I would like to comment on how to get the “epoch” number from the “checkpoint” number.

I’ve seen some posts where some users have “checkpoit_interval = 1” set. Therefore, at the end of the training you will have as many checkpoints as training epochs. In this case, a checkpoint can be directly associated with an epoch if it is ordered from lowest to highest (ref).

However when “checkpoint_interval> 1” it is a bit more complex to perform a direct association. However, the following formula can be used to obtain the epoch associated with a checkpoint.

where:

  • batch_size_per_gpu: Same value configured in “training_config”
  • val_split: Same value configured in the generation of tfrecords
  • num_training_images: Total amount of images in dataset (trainval)
  • checkpoint_number: Number with which each checkpoint is saved during training

For example if you have:
-batch_size_per_gpu = 3
-val_split = 0.2
-num_training_images = 1033

Then the correspondence between the following checkpoints and epochs is the following:


epoch: 0, checkpoint: trained_models/trained_model_90_0_1.0/model.step-0.tlt


epoch: 5, checkpoint: trained_models/trained_model_90_0_1.0/model.step-1380.tlt


epoch: 10, checkpoint: trained_models/trained_model_90_0_1.0/model.step-2760.tlt


epoch: 15, checkpoint: trained_models/trained_model_90_0_1.0/model.step-4140.tlt


epoch: 20, checkpoint: trained_models/trained_model_90_0_1.0/model.step-5520.tlt


epoch: 25, checkpoint: trained_models/trained_model_90_0_1.0/model.step-6900.tlt


epoch: 30, checkpoint: trained_models/trained_model_90_0_1.0/model.step-8280.tlt


epoch: 35, checkpoint: trained_models/trained_model_90_0_1.0/model.step-9660.tlt


epoch: 40, checkpoint: trained_models/trained_model_90_0_1.0/model.step-11040.tlt


epoch: 45, checkpoint: trained_models/trained_model_90_0_1.0/model.step-12420.tlt


epoch: 50, checkpoint: trained_models/trained_model_90_0_1.0/model.step-13800.tlt


epoch: 55, checkpoint: trained_models/trained_model_90_0_1.0/model.step-15180.tlt


epoch: 60, checkpoint: trained_models/trained_model_90_0_1.0/model.step-16560.tlt


epoch: 65, checkpoint: trained_models/trained_model_90_0_1.0/model.step-17940.tlt


epoch: 70, checkpoint: trained_models/trained_model_90_0_1.0/model.step-19320.tlt


epoch: 75, checkpoint: trained_models/trained_model_90_0_1.0/model.step-20700.tlt


epoch: 80, checkpoint: trained_models/trained_model_90_0_1.0/model.step-22080.tlt


epoch: 85, checkpoint: trained_models/trained_model_90_0_1.0/model.step-23460.tlt


epoch: 90, checkpoint: trained_models/trained_model_90_0_1.0/model.step-24840.tlt


Which is correct, since the training was done with “num_epochs = 90” and “checkpoint_interval = 5”. I hope someone will find this information useful. Cheers

Thanks for the sharing!
But take your example again,
-batch_size_per_gpu = 3
-val_split = 0.2
-num_training_images = 1033

when I calculate above epoch 10, the checkpoint_number will be 10 * (1 - 0.2) * 1033 / 3 = 2754.6

It is different from model.step-2760.tlt. Am I doing something wrong?

Or is your num_traiing_images 1035? If it is 1035, that will be correct.

You’re right about that, but I think this is mainly due to an approximation problem, since both the “epoch” and “checkpoint number” are integer values.

As I understand the checkpoint_number is associated with the number of iterations in which the neural network is “fed” with images. As in this case I have that my training dataset is 1033 * (1-0.2) = 826.4 images (impossible right?). It is assumed that during a training epoch, the model should be “fed” with these 826.4 images, but since batch_size = 3, then the model is “fed” in groups of 3 images (correct me if I’m wrong). Therefore, the model will “see” the entire training dataset (1 epoch) every 826.4 / 3 = 275.47 iterations (impossible too, right?). Finally, if this number of iterations is approximated to an integer value, 276 “feed” iterations will be necessary for each epoch.

With this approximate value, it is now possible to associate a checkpoint value to an epoch.
Epoch checkpoint number
0 → 0 * 276 = 0
5 → 5 * 276 = 1380
10 → 10 * 276 = 2760
25 → 25 * 276 = 6900
85 → 85 * 276 = 23460

The approximation from 275.47 to 276 should be done by tensorflow somewhere, but I’m not sure where. Well that’s my theory, regards.

1 Like

Thanks for the info. Appreciate it!