Retraining with pretrained tlt models

Nvidia forums don’t allow zip files, only jpeg, pdf, logzz types. What error are you experiencing when you try to click the onedrive link? Here is a google drive link with the same file:

I’ll test out that soft_start parameter this afternoon, although in other training frameworks that use a similar learning-rate scheduler, we still don’t see this drastic of a change in model performance in the beginning of re-training.

I also disagree with your statement that my graph “proves that pretrained model helps”. You state that at epoch 50 the pre-trained model is higher than no-pretrained model, which is true, but at epochs 80 and 85, the pre-trained model is LOWER than the no-pretrained model. I think these fluctuations are merely a result of random weight initialization.

I cannot access into the drive file too. Never mind, it is not a must to check now.
For pre-trained model not loading, I am syncing with internal team about the behavior. Will update if any finding.

Thanks @Morganh and I appreciate the quick responses from you!

Here is the graph where I’ve performed the same 3 experiments but this time with soft_start = 0.0

Not much difference at all, it still takes just as long to reach the plateau:

Hi harryhsl8c,
Update my previous comment, Please set load_graph to true during retraining.
In case of unpruned or pruned, if the entire model needs to be loaded during retraining, it is needed to set load_graph to true.

Prior to posting on these forums, I already tried multiple times with load_graph = true and with load_graph = false to see if there was a difference–there was not.

@Morganh I performed another 200 epochs of training using my pretrained tlt model as the pretrained_model_file , this time with load_graph = true. The results are the same and have been added to the graph. I attempted to use load_graph = true for the .h5 NGC model, which results in an error. There is no point in setting load_graph = true for the scenario where I use no model for pretrained_model_file.

Yesterday I was able to have a 1:1 discussion with Subhashree Radhakrishnan during the NGC “Meet with the Experts” session on the Transfer Learning Toolkit. I shared this issue and sent her a link to this page. She was unsure why TLT was behaving like this and said she would look into it.

Can we un-mark this forum post as Solved ? Unless this bug is specific to me, which we haven’t proved yet, I don’t believe the original question has sufficiently been answered.

Hi harryhsl8c,
Sorry for late reply. After checking, unfortunately we find there is an issue for detectnet_v2 only. It cannot load the pretrained model correctly during retraining. And this issue will be fixed in next tlt release.
Appreciate for your hard working and contribution.

1 Like

Thanks for the update! Do you know when the next tlt release might be? Weeks? Months?

I’m not sure. Maybe several weeks.

Hi harryhsl8c,
TLT 2.0 is released. Please use it. Thanks.

Thanks @Morganh I plan on getting into it today

Is the mAP per epoch stored somewhere? I’d like to make this graph as well.

The mAP result is shown during training.
You can draw this graph based on these results.