How to config tlt-train to save the best performed model

henryhyleung · September 8, 2020, 7:13pm

Hi,

I am running the detectnet_v2 example notebook come with the tlt toolkit container tlt-streamanalytics:v2.0_py3. I found at the end of the training, the model in the last epoch is saved in the experiment_dir_unpruned/weights/resnet18_detector.tlt. Is there amy way to config the tlt-train to save the model with the best mAP?

Thanks,

Morganh · September 9, 2020, 3:18am

If you want to check every epochs’ mAP, set as below.

evaluation_config {
validation_period_during_training: 1
first_validation_epoch: 1

Then you will find all the mAP result for each tlt model.
All the tlt model is saved in experiment_dir_unpruned/

For how to check which tlt model belongs to nth epoch, refer to

henryhyleung · September 9, 2020, 4:01pm

Hi, thanks for your reply. Wish the tool can picks the best mAP model and save as experiment_dir_unpruned/weights/resnet18_detector.tlt. Right now I have go through the long output and do that manually.

Morganh · September 10, 2020, 3:29am

In the spec, there is one setting as below.

checkpoint_interval: 10

TLT will save a tlt model every 10 epochs.

If you change to

checkpoint_interval: 1

TLT will save a tlt model every epoch.

BTW, after training done, you can also run tlt-evaluate to check the mAP of each tlt models.

import os
import glob

trained_model = glob.glob(“your_result_folder/model*.tlt”)
trained_model = sorted(trained_model, key=lambda x: int(x.rsplit(“-”)[1].rstrip(“.tlt”)))

for epoch, checkpoint in enumerate(trained_model):
    print("****************")
    print("epoch: {}, checkpoint: {}".format(epoch, checkpoint))
    os.system("tlt-evaluate detectnet_v2  -e your_spec.txt -m %s -k your_ngc_key" %(checkpoint))

henryhyleung · September 10, 2020, 5:43pm

Hi Morganh,

Thanks for your reply. It seems to be an good alternative to pick the best performed model than the manual process. It works in a python shell but execute the os.system(tlt-evaluate…) command in the notebook does not generate any message (only zero returned). Do you know why?

Thanks

henryhyleung · September 10, 2020, 7:05pm

Okay, I reply myself:
In the jupyter notebook, use the following instead:
print(os.popen(“tlt-evaluate…”).read())

Morganh · September 11, 2020, 5:45am

Good. Thanks for the info.

Topic		Replies	Views
tlt-train: Criteria to save models during training TAO Toolkit	5	801	October 12, 2021
How to chose the best epoch for the final trained model TAO Toolkit	12	733	March 19, 2023
Detectnet_v2 checkpoint interval unexpected results TAO Toolkit	9	876	October 12, 2021
How to get epoch based trained weights for detectnet_v2? TAO Toolkit	2	448	October 12, 2021
Is there any way to check whether tlt-train compltetes entire epochs without analysing manually? TAO Toolkit	4	634	October 12, 2021
Resume training from saved model TAO Toolkit	2	637	October 12, 2021
0 mAP over 50 Epochs while training TLT DetectNet_v2 MobileNet_v2 TAO Toolkit	11	1003	October 12, 2021
Retraining with pretrained tlt models TAO Toolkit	33	2799	October 12, 2021
TLT learning, validation and training details TAO Toolkit	4	520	September 6, 2021
Resume training from saved model.step in detectnet_v2 TAO Toolkit	4	1079	October 12, 2021

How to config tlt-train to save the best performed model

Related topics