How to chose the best epoch for the final trained model

• Hardware (RTX 3080)
• Network Type (Detectnet_v2)
• TLT Version v3.22.05-tf1.15.5-py3

Hi,

I have trained a Resnet50_detectnet_v2 network that performs pretty well, based on the final epoch of my most successful experiment. However there is a more useful result earlier in the same training program. I am pretty sure that there is a way to chose the epoch upon which the final model is based, but I cannot find any reference to this in the current documentation.

Is this possible?
Do I need anything other than the unpruned model?
Please tell me where the documentation for this can be found.

Thank you

The doc seems to change.
Can you use below to search for detectnet_v2?
https://docs.nvidia.com/search/index.html?page=1&sort=relevance&term=TAO%20Detectnet

You’re not wrong on that. There are 921+ documents, many of which are extremely similar and most run to at least 150 lines.

A site search for a string would be rather useful (head in hands emoji).

Please check if How to config tlt-train to save the best performed model - #3 by Morganh and How to config tlt-train to save the best performed model - #3 by Morganh works for you.

Thank you Morganh.

This is a great help.

Hi again,

further to this I am still having some issues.

Am I right to think that each of the .tlt files here should correspond to a saved epoch? I ask because there were 50 epochs for this experiment and there are 51 .tlt files, is the first one : “model.step.-0” the pre-trained weights?

Secondly, should I be able to substitute one of the model.step-*.tlt files for the “resnet18_detector.tlt” in the jupyter notebook evaluate cell?

I ask because I have not been able to make this work yet.
I have also been unable to make your code in the link work.
I know which epoch I am looking for.

Please advise

Can you run 1 epoch and double check? Use a new result folder.

Do you share the log?


The log says I haven’t supplied the path to the model file.

here’s a screenshot of my tao_mounts.json. It shows the path to the detectnet_v2 folder where the “experiment_pruned_dir” is located.
Inside the “experiment_pruned_dir” are the 51 .tlts, amongst other things.
I intend to run the “model.step-6552.tlt”, which I suspect corresponds to epoch 14, once I have the cell finding the model.
I have accomplished this part of the notebook in an earlier experiment, where I was using the “experiment_dir_pruned/weights/resnet50_detector.tlt”.

Please add a space after
image

Thanks for that. Sadly the issue persists.

Please try to delete this line.

Morganh, you are an absolute star.

Also that is the correct .tlt, so my earlier theory was correct: “model.step-0.tlt” is probably the pre-trained weights…

I did not realise that a commented out line of code could still influence a notebook cell to that extent. Many thanks

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.