Retraining with pretrained tlt models

harryhsl8c · March 24, 2020, 11:18am

Nvidia forums don’t allow zip files, only jpeg, pdf, logzz types. What error are you experiencing when you try to click the onedrive link? Here is a google drive link with the same file:

I’ll test out that soft_start parameter this afternoon, although in other training frameworks that use a similar learning-rate scheduler, we still don’t see this drastic of a change in model performance in the beginning of re-training.

I also disagree with your statement that my graph “proves that pretrained model helps”. You state that at epoch 50 the pre-trained model is higher than no-pretrained model, which is true, but at epochs 80 and 85, the pre-trained model is LOWER than the no-pretrained model. I think these fluctuations are merely a result of random weight initialization.

Morganh · March 24, 2020, 2:21pm

I cannot access into the drive file too. Never mind, it is not a must to check now.
For pre-trained model not loading, I am syncing with internal team about the behavior. Will update if any finding.

harryhsl8c · March 25, 2020, 1:46am

Thanks @Morganh and I appreciate the quick responses from you!

Here is the graph where I’ve performed the same 3 experiments but this time with soft_start = 0.0

Not much difference at all, it still takes just as long to reach the plateau:

Morganh · March 25, 2020, 5:40am

Hi harryhsl8c,
Update my previous comment, Please set load_graph to true during retraining.
In case of unpruned or pruned, if the entire model needs to be loaded during retraining, it is needed to set load_graph to true.

harryhsl8c · March 25, 2020, 12:22pm

Prior to posting on these forums, I already tried multiple times with load_graph = true and with load_graph = false to see if there was a difference–there was not.

harryhsl8c · March 28, 2020, 12:27pm

@Morganh I performed another 200 epochs of training using my pretrained tlt model as the pretrained_model_file , this time with load_graph = true. The results are the same and have been added to the graph. I attempted to use load_graph = true for the .h5 NGC model, which results in an error. There is no point in setting load_graph = true for the scenario where I use no model for pretrained_model_file.

Yesterday I was able to have a 1:1 discussion with Subhashree Radhakrishnan during the NGC “Meet with the Experts” session on the Transfer Learning Toolkit. I shared this issue and sent her a link to this page. She was unsure why TLT was behaving like this and said she would look into it.

Can we un-mark this forum post as Solved ? Unless this bug is specific to me, which we haven’t proved yet, I don’t believe the original question has sufficiently been answered.

Morganh · March 28, 2020, 2:29pm

Hi harryhsl8c,
Sorry for late reply. After checking, unfortunately we find there is an issue for detectnet_v2 only. It cannot load the pretrained model correctly during retraining. And this issue will be fixed in next tlt release.
Appreciate for your hard working and contribution.

harryhsl8c · March 28, 2020, 4:23pm

Thanks for the update! Do you know when the next tlt release might be? Weeks? Months?

Morganh · March 28, 2020, 5:21pm

I’m not sure. Maybe several weeks.

Morganh · May 7, 2020, 12:23am

Hi harryhsl8c,
TLT 2.0 is released. Please use it. Thanks.

harryhsl8c · May 21, 2020, 2:13pm

Thanks @Morganh I plan on getting into it today

H19012 · August 2, 2021, 1:33am

Is the mAP per epoch stored somewhere? I’d like to make this graph as well.

Morganh · August 2, 2021, 6:28am

The mAP result is shown during training.
You can draw this graph based on these results.

Topic		Replies	Views
Model retraining warning TAO Toolkit	7	1025	October 12, 2021
TLT training error : Key cost_sums/cyclist-bbox not found in checkpoint TAO Toolkit	6	1202	October 12, 2021
Error on tlt-training detectnet_v2? TAO Toolkit	6	474	October 12, 2021
Tlt-train loss is minimal but performances are bad TAO Toolkit	11	519	October 12, 2021
Deepstream_lpr_app runs slowly TAO Toolkit	27	817	November 30, 2021
0 mAP over 50 Epochs while training TLT DetectNet_v2 MobileNet_v2 TAO Toolkit	11	953	October 12, 2021
Error training Faster RCNN model TAO Toolkit	17	1557	October 12, 2021
Training detectnet_v2 Issue TAO Toolkit	15	1849	October 12, 2021
tlt-train error when deploy mobilenet_v2 by using DetectNet TAO Toolkit	28	2369	October 12, 2021
TLT - retrain trafficcamnet with customized data precision is 0 TAO Toolkit	20	595	October 12, 2021

Retraining with pretrained tlt models

Related topics