Hi,
I am training a DetectNet_v2 model with ResNet10 as backbone, in the training configuration I set checkpoint_interval = 1
This means TLT should save checkpoints at every epoch, but when I check trained_model directory the checkpoints are stored as:
root@tlt:/app# ls /project/trained_model
events.out.tfevents.1597134478.tlt model.step-20.tlt model.step-4.tlt model.step-56.tlt
experiment_spec.txt model.step-24.ckzip model.step-40.ckzip model.step-60.ckzip
graph.pbtxt model.step-24.tlt model.step-40.tlt model.step-60.tlt
model.step-0.ckzip model.step-28.ckzip model.step-44.ckzip model.step-64.ckzip
model.step-0.tlt model.step-28.tlt model.step-44.tlt model.step-64.tlt
model.step-12.ckzip model.step-32.ckzip model.step-48.ckzip model.step-8.ckzip
model.step-12.tlt model.step-32.tlt model.step-48.tlt model.step-8.tlt
model.step-16.ckzip model.step-36.ckzip model.step-52.ckzip monitor.json
model.step-16.tlt model.step-36.tlt model.step-52.tlt weights
model.step-20.ckzip model.step-4.ckzip model.step-56.ckzip
How to identify which checkpoints are saved for which epochs. I also tried to set checkpoint_interval to 0 or 10 but the outputs are mixed.
I want to save checkpoints for each epoch, how it can be achieved?
In the case of SSD, the weight files are stored inside trained_model/weights
and are easily accessible. I wanted to do the same with DetectNet_V2.
I am afraid you have achieved. The xxx-12.tlt or xxx-16.tlt is one of the tlt models for every 1 epoch. Please check the quantity of the tlt models to double check.
can you explain what I have achieved?
what do you mean by the quantity of tlt models ?
I trained for 20 epochs to check If i can get the desired model file for every epoch.
So far no luck.
@NitinRai
I run an experiment as you mentioned. I set checkpoint_interval = 1
and num_epochs: 10
.
After training done, I can find all the tlt models at epoch 0, 1, … 10. That meets your requirement.
`# ll -rlt *tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:20 model.step-0.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:22 model.step-403.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:25 model.step-806.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:26 model.step-1209.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:27 model.step-1612.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:28 model.step-2015.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:29 model.step-2418.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:30 model.step-2821.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:31 model.step-3224.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:32 model.step-3627.tlt
-rw-r–r-- 1 root root 45022576 Aug 17 06:33 model.step-4030.tlt
Can you please explain how do you identify checkpoints for a given epoch.
for eg: I want to select checkpoint saved for the 5th epoch how do I identify ?
or
model.step-1612.tlt
which epoch does this checkpoint belongs to ?
It can be sorted from smallest to largest.
int(tlt_file.split(‘.’)[1].split(‘-’)[1])
For example,
The smallest epoch (0th epoch) will correspond to model.step-0.tlt
The largest epoch (10th epoch) will correspond to model.step-4030.tlt
1 Like
well if larger step number corresponds to larger epoch number, this hack can be used.
since we are saving each and every epoch, this can be challenging to exactly determine the checkpoint for given epoch, especially when the training is live and we want to check on older checkpoints.
I hope this thing can be fixed, I really like how ssd models are saved in an uniform order.
I would like to highlight one more thing here:
detectnet_v2
saves checkpoints inside trained_model
dir, while ssd
saves checkpoints inside trained_model/weights
Thanks for the hack
Sure, I will sync with internal team about your request.
Temporally, end user can
- check all the tlt checkpoint files and filter its
int(tlt_file.split(’.’)[1].split(’-’)[1])
- enumerate it , to find which epoch is running.
I hope this script will help others
import os, glob
trained_model = glob.glob("trained_model/model*.tlt")
trained_model = sorted(trained_model, key=lambda x: int(x.rsplit("-")[1].rstrip(".tlt")))
for epoch, checkpoint in enumerate(trained_model):
# checkpoint = os.path.basename(checkpoint)
print("epoch: {}, checkpoint: {}".format(epoch, checkpoint))
Output
epoch: 0, checkpoint: trained_model/model.step-0.tlt
epoch: 1, checkpoint: trained_model/model.step-403.tlt
epoch: 2, checkpoint: trained_model/model.step-806.tlt
epoch: 3, checkpoint: trained_model/model.step-1209.tlt
epoch: 4, checkpoint: trained_model/model.step-1612.tlt
epoch: 5, checkpoint: trained_model/model.step-2015.tlt
epoch: 6, checkpoint: trained_model/model.step-2418.tlt
epoch: 7, checkpoint: trained_model/model.step-2821.tlt
epoch: 8, checkpoint: trained_model/model.step-3224.tlt
epoch: 9, checkpoint: trained_model/model.step-3627.tlt
epoch: 10, checkpoint: trained_model/model.step-4030.tlt
This also filters out: events.out.tfevents.1597134478.tlt
file, which is present in the same directory.