Poor metric results after retraining maskrcnn using TLT notebook

Operating System: Ubuntu 18.04
TLT: nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3
GPUs: 2
GPU Spec: RTX 2080 Ti

I have followed following guide as well as all the instructions in the notebook as it is:

My only changes are:

From (original notebook: specs/maskrcnn_train_resnet50.txt):
learning_rate_steps: “[10000, 15000, 20000]”
total_steps: 25000
init_learning_rate: 0.01

From (used for transfer learning: specs/maskrcnn_train_resnet50.txt):
learning_rate_steps: “[60000, 80000, 100000]”
total_steps: 120000
init_learning_rate: 0.005

However TLT metric results are quite poor as compared to the ones published in article (https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-the-transfer-learning-toolkit/).

[MaskRCNN] INFO : AP: 0.204600319
[MaskRCNN] INFO : AP50: 0.369061261
[MaskRCNN] INFO : AP75: 0.203879535
[MaskRCNN] INFO : APl: 0.276150644
[MaskRCNN] INFO : APm: 0.210601568
[MaskRCNN] INFO : APs: 0.121632718
[MaskRCNN] INFO : ARl: 0.480189115
[MaskRCNN] INFO : ARm: 0.361170918
[MaskRCNN] INFO : ARmax1: 0.213508740
[MaskRCNN] INFO : ARmax10: 0.346155792
[MaskRCNN] INFO : ARmax100: 0.361603230
[MaskRCNN] INFO : ARs: 0.210817903
[MaskRCNN] INFO : mask_AP: 0.195189938
[MaskRCNN] INFO : mask_AP50: 0.337781459
[MaskRCNN] INFO : mask_AP75: 0.193039805
[MaskRCNN] INFO : mask_APl: 0.277761608
[MaskRCNN] INFO : mask_APm: 0.201220781
[MaskRCNN] INFO : mask_APs: 0.104112484
[MaskRCNN] INFO : mask_ARl: 0.446928442
[MaskRCNN] INFO : mask_ARm: 0.345584035
[MaskRCNN] INFO : mask_ARmax1: 0.202073544
[MaskRCNN] INFO : mask_ARmax10: 0.321329713
[MaskRCNN] INFO : mask_ARmax100: 0.332932264
[MaskRCNN] INFO : mask_ARs: 0.184002846

How do I get better results which are closer to published in article?


Could you paste your training spec? Thanks.

More, if possible, could you try learning rate=0.02 or 0.01?

Just to make sure we are on same page. The key goal of TLT is to train and fine-tune a model using the user’s own dataset. At the moment I am not using any of the my own dataset. I am trying to reproduce the results published in the article. I have rerun the training to make sure I am running the exact setup. However unfortunately I have only achieved 66% of the performance published in blog (https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-the-transfer-learning-toolkit/)

I could try other learning rates but we need to be sure that it is best thing to do next. As of now it makes sense to follow the instructions of the author so followed the exact steps. 0.02 learning rate is for 8 GPUs (Nvidia’s Setup) and in my setup there are 2 GPUs then learning rate must be 0.005 [i.e. (0.02/8)*2 following a linear scaling rule]

Final metrics I have after 100K iterations are:
[MaskRCNN] INFO : AP: 0.211550236
[MaskRCNN] INFO : AP50: 0.383451581
[MaskRCNN] INFO : AP75: 0.209822893
[MaskRCNN] INFO : APl: 0.291358769
[MaskRCNN] INFO : APm: 0.219301566
[MaskRCNN] INFO : APs: 0.096314505
[MaskRCNN] INFO : ARl: 0.535470128
[MaskRCNN] INFO : ARm: 0.410021156
[MaskRCNN] INFO : ARmax1: 0.225859717
[MaskRCNN] INFO : ARmax10: 0.370030344
[MaskRCNN] INFO : ARmax100: 0.388801515
[MaskRCNN] INFO : ARs: 0.209091455
[MaskRCNN] INFO : mask_AP: 0.202502176
[MaskRCNN] INFO : mask_AP50: 0.355926216
[MaskRCNN] INFO : mask_AP75: 0.206854612
[MaskRCNN] INFO : mask_APl: 0.293817312
[MaskRCNN] INFO : mask_APm: 0.210412666
[MaskRCNN] INFO : mask_APs: 0.082449332
[MaskRCNN] INFO : mask_ARl: 0.514913499
[MaskRCNN] INFO : mask_ARm: 0.385058582
[MaskRCNN] INFO : mask_ARmax1: 0.220792443
[MaskRCNN] INFO : mask_ARmax10: 0.347017348
[MaskRCNN] INFO : mask_ARmax100: 0.362540752
[MaskRCNN] INFO : mask_ARs: 0.185705140

I have also noticed that results did not oscillated and sounded like converged asymptotically e.g. below are AP values after each 10000 iterations:

[MaskRCNN] INFO : AP: 0.057683785 (10K iterations)
[MaskRCNN] INFO : AP: 0.098032616 (20K iterations)
[MaskRCNN] INFO : AP: 0.125270441 (30K iterations)
[MaskRCNN] INFO : AP: 0.148156375 (40K iterations)
[MaskRCNN] INFO : AP: 0.160243452 (50K iterations)
[MaskRCNN] INFO : AP: 0.169225559 (60K iterations)
[MaskRCNN] INFO : AP: 0.203981400 (70K iterations)
[MaskRCNN] INFO : AP: 0.207072541 (80K iterations)
[MaskRCNN] INFO : AP: 0.211082965 (90K iterations)
[MaskRCNN] INFO : AP: 0.211550236 (100K iterations)

The metric published in blog:
AP: 0.334154785

The contents of spec file (maskrcnn_train_resnet50.txt) are:

seed: 123
use_amp: False
warmup_steps: 0
checkpoint: “/workspace/tlt-experiments/maskrcnn/model/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5”

learning_rate_steps: “[60000, 80000, 90000]”
learning_rate_decay_levels: “[0.1, 0.01, 0.001]”
total_steps: 100000
train_batch_size: 2
eval_batch_size: 8
num_steps_per_eval: 10000
momentum: 0.9
l2_weight_decay: 0.00002
warmup_learning_rate: 0.0001
init_learning_rate: 0.005

image_size: “(832, 1344)”
augment_input_data: True
eval_samples: 5000
training_file_pattern: “/workspace/tlt-experiments/maskrcnn/data/train*.tfrecord”
validation_file_pattern: “/workspace/tlt-experiments/maskrcnn/data/val*.tfrecord”
val_json_file: “/workspace/tlt-experiments/maskrcnn/data/annotations/instances_val2017.json”

# dataset specific parameters
num_classes: 91
skip_crowd_during_training: True


maskrcnn_config {
nlayers: 50
arch: “resnet”
freeze_bn: True
freeze_blocks: “[0,1]”
gt_mask_size: 112

# Region Proposal Network
rpn_positive_overlap: 0.7
rpn_negative_overlap: 0.3
rpn_batch_size_per_im: 256
rpn_fg_fraction: 0.5
rpn_min_size: 0.

# Proposal layer.
batch_size_per_im: 512
fg_fraction: 0.25
fg_thresh: 0.5
bg_thresh_hi: 0.5
bg_thresh_lo: 0.

# Faster-RCNN heads.
fast_rcnn_mlp_head_dim: 1024
bbox_reg_weights: "(10., 10., 5., 5.)"

# Mask-RCNN heads.
include_mask: True
mrcnn_resolution: 28

# training
train_rpn_pre_nms_topn: 2000
train_rpn_post_nms_topn: 1000
train_rpn_nms_threshold: 0.7

# evaluation
test_detections_per_image: 100
test_nms: 0.5
test_rpn_pre_nms_topn: 1000
test_rpn_post_nms_topn: 1000
test_rpn_nms_thresh: 0.7

# model architecture
min_level: 2
max_level: 6
num_scales: 1
aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
anchor_scale: 8

# localization loss
rpn_box_loss_weight: 1.0
fast_rcnn_box_loss_weight: 1.0
mrcnn_weight_loss_mask: 1.0


After checking, your previous learning rate(0.005) is fine.

But need to enlarge the total_step.
total_steps = total_images * total_epochs / batch_size / nGPUs

Your batchsize=2, the same as blog’s.
Your training gpus is 2, while blog is using 8gpus.
Your total_steps is 100k, while blog’s total_step is 100k.

So, please increase total number of steps to 400k.

Thanks. I’ll run the training again.

To make sure my calculations are counter-checked and right, what is value of total_epochs? or Where do I get total_epochs value from?

I haven’t found this anywhere (paper, spec file or blog). Thank you.

For maskrcnn, it is virtual. You can consider total_images*total_epochs as total training images during your training.

Thanks. What about learning_rate_steps: “[60000, 80000, 90000]”

From the blog they are design for 100K iterations. What should be their values for 400K iterations?

You can try 4x.
“[240000, 320000, 360000]”

I have done the training for 400K iterations. The log file and the spec file are shared under:


The key summary from the log.txt (produced by TLT system after training for 400k iterations):

AP: 0.221518874
AP50: 0.388619840
AP75: 0.225120232

This is 66% of the figures published in the blog (link in my original post) which is same result as for the 100k iterations. Sounds like going to 400k from 100k iterations because GPU were reduced from 8 to 2 might not be the best course of action. If results converged in blog on 100k iterations (8 GPU setup) then it should converged in 100k in my setup as well (2 GPU setup). The only difference should be if blog’s training finished in 4 hours then mine will finish in 16 hours.

Changes in init_learning_rate from 0.02 (8 GPU) to 0.005 (2 GPU) made sense though but sadly didn’t have much impact on the outcome.

Could you please collect and provide feedback on how the results in blog can be reproduced? Thanks.


PS: listing only AP values from log.txt for 40 evaluation rounds (after each 10000 iterations)

AP: 0.053609539 (after 10k iterations)
AP: 0.099530540 (after 20k iterations)
AP: 0.127319917 (after 30k iterations)
AP: 0.142913073 (after 40k iterations)
AP: 0.155241147 (after 50k iterations)
AP: 0.166217625 (after 60k iterations)
AP: 0.171844959 (after 70k iterations)
AP: 0.177070886 (after 80k iterations)
AP: 0.177705094 (after 90k iterations)
AP: 0.184269711 (after 100k iterations)
AP: 0.183399275 (after 110k iterations)
AP: 0.185981736 (after 120k iterations)
AP: 0.187121913 (after 130k iterations)
AP: 0.188551322 (after 140k iterations)
AP: 0.190065354 (after 150k iterations)
AP: 0.190615743 (after 160k iterations)
AP: 0.192069024 (after 170k iterations)
AP: 0.187974632 (after 180k iterations)
AP: 0.191580266 (after 190k iterations)
AP: 0.189789638 (after 200k iterations)
AP: 0.194854811 (after 210k iterations)
AP: 0.190808654 (after 220k iterations)
AP: 0.192894056 (after 230k iterations)
AP: 0.197109401 (after 240k iterations)
AP: 0.222581357 (after 250k iterations)
AP: 0.222711533 (after 260k iterations)
AP: 0.223115027 (after 270k iterations)
AP: 0.221604973 (after 280k iterations)
AP: 0.222837299 (after 290k iterations)
AP: 0.223410621 (after 300k iterations)
AP: 0.220184788 (after 310k iterations)
AP: 0.220856011 (after 320k iterations)
AP: 0.221117750 (after 330k iterations)
AP: 0.221274659 (after 340k iterations)
AP: 0.221165419 (after 350k iterations)
AP: 0.221567094 (after 360k iterations)
AP: 0.221825302 (after 370k iterations)
AP: 0.222074255 (after 380k iterations)
AP: 0.221849650 (after 390k iterations)
AP: 0.221518874 (after 400k iterations)

Thanks for the update. I will dig out further.

Update: After finetune the blog’s warmup_steps from 0 to 1000, the AP is 0.31344375, which is closed to blog’s AP( 0.334154785 ).
Note that, I am using 8gpus(v100) training.

Upload spec file for reference.
maskrcnn_blog_finetune.txt (2.1 KB)

Thank you. I’ll do another train-run on this and use this spec. The only difference now is “init_learning_rate: 0.005” because in my setup there are 2 GPUs

I add more info to my previous comment. I am using 8gpus(v100) to train.
I will continue to dig out 2gpus’ spec.

Many thanks for clarifying.

Hope I understand you correctly now that you are getting 0.30+ results in 8 GPUs (v100) setup and presumably on single class problem but haven’t yet tried it with 2 GPUs.

End of the day these are calculations so accuracy should not depend on number of GPUs. Yes time to perform those calculation would definitely change and will be 4 times more which is understandable.

Ok I’ll wait for your comments. Thanks again for looking into this.

For 2gpus, please try to trigger training as below spec. Per the latest result from Nvidia internal team, training with 2 gpus(V100), the AP can get 33.2 in the end.

seed: 123
use_amp: False
warmup_steps: 50000
checkpoint: “/workspace/tlt-experiments/mask_rcnn/resnet50.hdf5”
learning_rate_steps: “[360000, 540000]”
learning_rate_decay_levels: “[0.1, 0.01]”
total_steps: 720000
train_batch_size: 2
eval_batch_size: 8
num_steps_per_eval: 60000
momentum: 0.9
l2_weight_decay: 0.00002
warmup_learning_rate: 0.00001
init_learning_rate: 0.005

Many thanks for sharing the results.

Could you please share the logs produced by tlt-train in internal run? Training in my setup for this spec is going to take about a week (6-7 days) so logs will help me in understanding/aligning the convergence. Thank you.

Unfortunately, the training log is not saved from his side. Below is a tensorboard graph.

joblog.zip (1.3 MB) Share one log which runs from my side.

Many thanks Morganh

Last night my run also completed with AP around 0.33

1 Like