Detectnet_v2 checkpoint interval unexpected results


I am training a DetectNet_v2 model with ResNet10 as backbone, in the training configuration I set checkpoint_interval = 1 This means TLT should save checkpoints at every epoch, but when I check trained_model directory the checkpoints are stored as:

root@tlt:/app# ls /project/trained_model
events.out.tfevents.1597134478.tlt  model.step-20.tlt    model.step-4.tlt     model.step-56.tlt
experiment_spec.txt                 model.step-24.ckzip  model.step-40.ckzip  model.step-60.ckzip
graph.pbtxt                         model.step-24.tlt    model.step-40.tlt    model.step-60.tlt
model.step-0.ckzip                  model.step-28.ckzip  model.step-44.ckzip  model.step-64.ckzip
model.step-0.tlt                    model.step-28.tlt    model.step-44.tlt    model.step-64.tlt
model.step-12.ckzip                 model.step-32.ckzip  model.step-48.ckzip  model.step-8.ckzip
model.step-12.tlt                   model.step-32.tlt    model.step-48.tlt    model.step-8.tlt
model.step-16.ckzip                 model.step-36.ckzip  model.step-52.ckzip  monitor.json
model.step-16.tlt                   model.step-36.tlt    model.step-52.tlt    weights
model.step-20.ckzip                 model.step-4.ckzip   model.step-56.ckzip

How to identify which checkpoints are saved for which epochs. I also tried to set checkpoint_interval to 0 or 10 but the outputs are mixed.

I want to save checkpoints for each epoch, how it can be achieved?

In the case of SSD, the weight files are stored inside trained_model/weights and are easily accessible. I wanted to do the same with DetectNet_V2.

I am afraid you have achieved. The xxx-12.tlt or xxx-16.tlt is one of the tlt models for every 1 epoch. Please check the quantity of the tlt models to double check.

can you explain what I have achieved?
what do you mean by the quantity of tlt models ?

I trained for 20 epochs to check If i can get the desired model file for every epoch.
So far no luck.

I run an experiment as you mentioned. I set checkpoint_interval = 1 and num_epochs: 10.
After training done, I can find all the tlt models at epoch 0, 1, … 10. That meets your requirement.

`# ll -rlt *tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:20 model.step-0.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:22 model.step-403.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:25 model.step-806.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:26 model.step-1209.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:27 model.step-1612.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:28 model.step-2015.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:29 model.step-2418.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:30 model.step-2821.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:31 model.step-3224.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:32 model.step-3627.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:33 model.step-4030.tlt

Can you please explain how do you identify checkpoints for a given epoch.
for eg: I want to select checkpoint saved for the 5th epoch how do I identify ?


model.step-1612.tlt which epoch does this checkpoint belongs to ?

It can be sorted from smallest to largest.
For example,
The smallest epoch (0th epoch) will correspond to model.step-0.tlt
The largest epoch (10th epoch) will correspond to model.step-4030.tlt

1 Like

well if larger step number corresponds to larger epoch number, this hack can be used.
since we are saving each and every epoch, this can be challenging to exactly determine the checkpoint for given epoch, especially when the training is live and we want to check on older checkpoints.

I hope this thing can be fixed, I really like how ssd models are saved in an uniform order.

I would like to highlight one more thing here:
detectnet_v2 saves checkpoints inside trained_model dir, while ssd saves checkpoints inside trained_model/weights

Thanks for the hack

Sure, I will sync with internal team about your request.
Temporally, end user can

  1. check all the tlt checkpoint files and filter its int(tlt_file.split(’.’)[1].split(’-’)[1])
  2. enumerate it , to find which epoch is running.

I hope this script will help others

import os, glob

trained_model = glob.glob("trained_model/model*.tlt")
trained_model = sorted(trained_model, key=lambda x: int(x.rsplit("-")[1].rstrip(".tlt")))

for epoch, checkpoint in enumerate(trained_model):
    # checkpoint = os.path.basename(checkpoint)
    print("epoch: {}, checkpoint: {}".format(epoch, checkpoint))


epoch: 0, checkpoint: trained_model/model.step-0.tlt
epoch: 1, checkpoint: trained_model/model.step-403.tlt
epoch: 2, checkpoint: trained_model/model.step-806.tlt
epoch: 3, checkpoint: trained_model/model.step-1209.tlt
epoch: 4, checkpoint: trained_model/model.step-1612.tlt
epoch: 5, checkpoint: trained_model/model.step-2015.tlt
epoch: 6, checkpoint: trained_model/model.step-2418.tlt
epoch: 7, checkpoint: trained_model/model.step-2821.tlt
epoch: 8, checkpoint: trained_model/model.step-3224.tlt
epoch: 9, checkpoint: trained_model/model.step-3627.tlt
epoch: 10, checkpoint: trained_model/model.step-4030.tlt

This also filters out: events.out.tfevents.1597134478.tlt file, which is present in the same directory.