TAO SSD training saves model weights after each epoch

thor.tomasarson · October 9, 2021, 11:17pm

Hi, I am training a SSD model with in the TAO docker nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3
I run the model training, with in the docker, like this:

ssd train --gpus 1 --gpu_index=0 -e specs/ssd_train_resnet18_kitti.txt -r output/unpruned -k mykey

my specs file is attached. ssd_train_resnet18_kitti.txt (1.4 KB)

The SSD training saves the model weight after each epoch, and with each file ~100MB then the space required becomes large quite quickly. Is there any way to change that?
I would prefer to only keep the best model weights based on the mAP validation metric.

Morganh · October 10, 2021, 3:34pm

You can run ssd evaluate xxx against each tlt model to in order to get its mAP.
And then write script to save the best model.

thor.tomasarson · October 11, 2021, 9:57am

The ssd train does log for each epoch the validation mAP, loss and other metrics to a ssd_training_log_resnet18.csv file. I do use that to prune out the model files that I do not want to store. But that feels kind of backwards to me. The SSD training is implemented in the TensorFlow framework and it should be rather simple to allow for different ModelCheckpoint strategies. I would like this to be a configurable option for all TAO training jobs in next TAO versions.
Is there any platform where I could send a feature request to the TAO backlog?

P.S. attached is my implementation for pruning out the extra SSD model weights SSD_KeepOnlyTheLatestOrGratest.py (1.6 KB)

Morganh · October 12, 2021, 3:08am

Yes, you can delete the extra tlt model according to the mAP result in ssd_training_log_resnet18.csv file.
For this feature request, I will sync with internal team.
Currently, the forum is the platform. I can also ask internal team about the bridge for end user to submit feature request.

system · October 26, 2021, 3:08am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.