How to config tlt-train to save the best performed model


I am running the detectnet_v2 example notebook come with the tlt toolkit container tlt-streamanalytics:v2.0_py3. I found at the end of the training, the model in the last epoch is saved in the experiment_dir_unpruned/weights/resnet18_detector.tlt. Is there amy way to config the tlt-train to save the model with the best mAP?


If you want to check every epochs’ mAP, set as below.

evaluation_config {
validation_period_during_training: 1
first_validation_epoch: 1

Then you will find all the mAP result for each tlt model.
All the tlt model is saved in experiment_dir_unpruned/

For how to check which tlt model belongs to nth epoch, refer to

Hi, thanks for your reply. Wish the tool can picks the best mAP model and save as experiment_dir_unpruned/weights/resnet18_detector.tlt. Right now I have go through the long output and do that manually.

In the spec, there is one setting as below.

checkpoint_interval: 10

TLT will save a tlt model every 10 epochs.

If you change to

checkpoint_interval: 1

TLT will save a tlt model every epoch.

BTW, after training done, you can also run tlt-evaluate to check the mAP of each tlt models.

import os
import glob

trained_model = glob.glob(“your_result_folder/model*.tlt”)
trained_model = sorted(trained_model, key=lambda x: int(x.rsplit("-")[1].rstrip(".tlt")))

for epoch, checkpoint in enumerate(trained_model):

    print("epoch: {}, checkpoint: {}".format(epoch, checkpoint))
    os.system("tlt-evaluate detectnet_v2  -e your_spec.txt -m %s -k your_ngc_key" %(checkpoint))
1 Like

Hi Morganh,

Thanks for your reply. It seems to be an good alternative to pick the best performed model than the manual process. It works in a python shell but execute the os.system(tlt-evaluate…) command in the notebook does not generate any message (only zero returned). Do you know why?


Okay, I reply myself:
In the jupyter notebook, use the following instead:

Good. Thanks for the info.