Poor metric results after retraining maskrcnn using TLT notebook

@ghazni
After checking, your previous learning rate(0.005) is fine.

But need to enlarge the total_step.
total_steps = total_images * total_epochs / batch_size / nGPUs

Your batchsize=2, the same as blog’s.
Your training gpus is 2, while blog is using 8gpus.
Your total_steps is 100k, while blog’s total_step is 100k.

So, please increase total number of steps to 400k.

Thanks. I’ll run the training again.

To make sure my calculations are counter-checked and right, what is value of total_epochs? or Where do I get total_epochs value from?

I haven’t found this anywhere (paper, spec file or blog). Thank you.

For maskrcnn, it is virtual. You can consider total_images*total_epochs as total training images during your training.

Thanks. What about learning_rate_steps: “[60000, 80000, 90000]”

From the blog they are design for 100K iterations. What should be their values for 400K iterations?

You can try 4x.
“[240000, 320000, 360000]”

I have done the training for 400K iterations. The log file and the spec file are shared under:

https://drive.google.com/drive/folders/1DkZjYIu1TmUZAuqIZZuiBrtS9a6SDWyw?usp=sharing

The key summary from the log.txt (produced by TLT system after training for 400k iterations):

AP: 0.221518874
AP50: 0.388619840
AP75: 0.225120232

This is 66% of the figures published in the blog (link in my original post) which is same result as for the 100k iterations. Sounds like going to 400k from 100k iterations because GPU were reduced from 8 to 2 might not be the best course of action. If results converged in blog on 100k iterations (8 GPU setup) then it should converged in 100k in my setup as well (2 GPU setup). The only difference should be if blog’s training finished in 4 hours then mine will finish in 16 hours.

Changes in init_learning_rate from 0.02 (8 GPU) to 0.005 (2 GPU) made sense though but sadly didn’t have much impact on the outcome.

Could you please collect and provide feedback on how the results in blog can be reproduced? Thanks.

Regards,
Ghazni

PS: listing only AP values from log.txt for 40 evaluation rounds (after each 10000 iterations)

AP: 0.053609539 (after 10k iterations)
AP: 0.099530540 (after 20k iterations)
AP: 0.127319917 (after 30k iterations)
AP: 0.142913073 (after 40k iterations)
AP: 0.155241147 (after 50k iterations)
AP: 0.166217625 (after 60k iterations)
AP: 0.171844959 (after 70k iterations)
AP: 0.177070886 (after 80k iterations)
AP: 0.177705094 (after 90k iterations)
AP: 0.184269711 (after 100k iterations)
AP: 0.183399275 (after 110k iterations)
AP: 0.185981736 (after 120k iterations)
AP: 0.187121913 (after 130k iterations)
AP: 0.188551322 (after 140k iterations)
AP: 0.190065354 (after 150k iterations)
AP: 0.190615743 (after 160k iterations)
AP: 0.192069024 (after 170k iterations)
AP: 0.187974632 (after 180k iterations)
AP: 0.191580266 (after 190k iterations)
AP: 0.189789638 (after 200k iterations)
AP: 0.194854811 (after 210k iterations)
AP: 0.190808654 (after 220k iterations)
AP: 0.192894056 (after 230k iterations)
AP: 0.197109401 (after 240k iterations)
AP: 0.222581357 (after 250k iterations)
AP: 0.222711533 (after 260k iterations)
AP: 0.223115027 (after 270k iterations)
AP: 0.221604973 (after 280k iterations)
AP: 0.222837299 (after 290k iterations)
AP: 0.223410621 (after 300k iterations)
AP: 0.220184788 (after 310k iterations)
AP: 0.220856011 (after 320k iterations)
AP: 0.221117750 (after 330k iterations)
AP: 0.221274659 (after 340k iterations)
AP: 0.221165419 (after 350k iterations)
AP: 0.221567094 (after 360k iterations)
AP: 0.221825302 (after 370k iterations)
AP: 0.222074255 (after 380k iterations)
AP: 0.221849650 (after 390k iterations)
AP: 0.221518874 (after 400k iterations)

Thanks for the update. I will dig out further.

Update: After finetune the blog’s warmup_steps from 0 to 1000, the AP is 0.31344375, which is closed to blog’s AP( 0.334154785 ).
Note that, I am using 8gpus(v100) training.

Upload spec file for reference.
maskrcnn_blog_finetune.txt (2.1 KB)

Thank you. I’ll do another train-run on this and use this spec. The only difference now is “init_learning_rate: 0.005” because in my setup there are 2 GPUs

I add more info to my previous comment. I am using 8gpus(v100) to train.
I will continue to dig out 2gpus’ spec.

Many thanks for clarifying.

Hope I understand you correctly now that you are getting 0.30+ results in 8 GPUs (v100) setup and presumably on single class problem but haven’t yet tried it with 2 GPUs.

End of the day these are calculations so accuracy should not depend on number of GPUs. Yes time to perform those calculation would definitely change and will be 4 times more which is understandable.

Ok I’ll wait for your comments. Thanks again for looking into this.

For 2gpus, please try to trigger training as below spec. Per the latest result from Nvidia internal team, training with 2 gpus(V100), the AP can get 33.2 in the end.

seed: 123
use_amp: False
warmup_steps: 50000
checkpoint: “/workspace/tlt-experiments/mask_rcnn/resnet50.hdf5”
learning_rate_steps: “[360000, 540000]”
learning_rate_decay_levels: “[0.1, 0.01]”
total_steps: 720000
train_batch_size: 2
eval_batch_size: 8
num_steps_per_eval: 60000
momentum: 0.9
l2_weight_decay: 0.00002
warmup_learning_rate: 0.00001
init_learning_rate: 0.005

Many thanks for sharing the results.

Could you please share the logs produced by tlt-train in internal run? Training in my setup for this spec is going to take about a week (6-7 days) so logs will help me in understanding/aligning the convergence. Thank you.

Unfortunately, the training log is not saved from his side. Below is a tensorboard graph.

joblog.zip (1.3 MB) Share one log which runs from my side.

Many thanks Morganh

Last night my run also completed with AP around 0.33

1 Like

Thanks for the info. I will close this topic.

could you please define these parameters for 1 gpu?

Hi p.vahidinia,

Please help to open a new topic with more details of your issue. Thanks

1 Like