Maskrcnn.ipynb - followed notebook and ended up with poor (almost untrained) network from instructions

Please provide the following information when requesting support.

• Hardware 1080ti
• Network Type MaskRCNN
• TLT Version 3.0
• Training spec file: resnet50 in maskrcnn examples
• How to reproduce the issue ? Followed steps in the example using COCO db and supposedly pre-trained MRCNN resnet50 model from NVIDIA. From tlt_cv_samples_v1.1.0

Any assistance on getting this basic training example to work?

One would expect that doing a TLT on the existing model on COCO would just lead to a network not much different than before. I followed the existing example from NVIDIA following the quick-start for TLT

From that notebook I ran the cell:

!tlt mask_rcnn evaluate -e $SPECS_DIR/maskrcnn_train_resnet50.txt
-m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-$NUM_STEP.tlt
-k $KEY

The eval step produced this mAP:

[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Evaluation Performance Summary
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #

[MaskRCNN] INFO : Average throughput: 4.0 samples/sec
[MaskRCNN] INFO : Total processed steps: 125
[MaskRCNN] INFO : Total processing time: 0.0h 08m 12s
[MaskRCNN] INFO : ==================== Metrics ====================
[MaskRCNN] INFO : AP: 0.067873947
[MaskRCNN] INFO : AP50: 0.127887100
[MaskRCNN] INFO : AP75: 0.066672139
[MaskRCNN] INFO : APl: 0.102755368
[MaskRCNN] INFO : APm: 0.091018304
[MaskRCNN] INFO : APs: 0.028439207
[MaskRCNN] INFO : ARl: 0.303018242
[MaskRCNN] INFO : ARm: 0.195692867
[MaskRCNN] INFO : ARmax1: 0.102624930
[MaskRCNN] INFO : ARmax10: 0.187838435
[MaskRCNN] INFO : ARmax100: 0.193126634
[MaskRCNN] INFO : ARs: 0.066572912
[MaskRCNN] INFO : mask_AP: 0.063729405
[MaskRCNN] INFO : mask_AP50: 0.116767384
[MaskRCNN] INFO : mask_AP75: 0.061803028
[MaskRCNN] INFO : mask_APl: 0.101339318
[MaskRCNN] INFO : mask_APm: 0.082119964
[MaskRCNN] INFO : mask_APs: 0.020874247
[MaskRCNN] INFO : mask_ARl: 0.275262952
[MaskRCNN] INFO : mask_ARm: 0.182158113
[MaskRCNN] INFO : mask_ARmax1: 0.094126195
[MaskRCNN] INFO : mask_ARmax10: 0.168800414
[MaskRCNN] INFO : mask_ARmax100: 0.172687247
[MaskRCNN] INFO : mask_ARs: 0.052210771

See Poor metric results after retraining maskrcnn using TLT notebook - #17 by Morganh , it needs longer training. Refer to the spec and the log I shared in that forum topic.

I read that. Makes no sense why a transfer from a good model should take more training on the dataset it was trained on.

Can you send me how to get the pre trained MRCNN model? I actual do not want to train coco from scratch.

Thanks

Let me clarify. See NVIDIA TAO Documentation , the pre-trained weights are trained on a subset of the Google OpenImages dataset. It is used as a starting point.

Currently we cannot release pre-trained weights which are trained on COCO dataset due to legal reasons.

Thanks - NVIDIA should really fix this issue - either by licensing or investing in its own comparable COCO-comparable datasets. I’m surprised at how bad the transfer-learned model is on COCO - it is not a refinement but significant re-training. I found a blog on a rather time-expensive training ImageNet then doing Mask again. Also, I am confused as I just found: NGC Catalog Entry which looks like a partially trained model on COCO2017 in the model catalog. Are there instructions on how to use this checkpoint as a starting point for TLT training?

The NGC Catalog Entry mentioned by you is not compatible with TLT.

In TLT, there is a purpose-build model – Peoplesegnet. See NVIDIA TAO Documentation . The model detects one or more “person” objects within an image and returns a box around each object, as well as a segmentation mask for each object.
This model is based on MaskRCNN with ResNet50 as its feature extractor.

I’m trying your formula on COCO on MRCNN resnet50:

seed: 123
use_amp: False
warmup_steps: 50000
checkpoint: “/workspace/tlt-experiments/mask_rcnn/resnet50.hdf5”
learning_rate_steps: “[360000, 540000]”
learning_rate_decay_levels: “[0.1, 0.01]”
total_steps: 720000
train_batch_size: 2
eval_batch_size: 8
num_steps_per_eval: 60000
momentum: 0.9
l2_weight_decay: 0.00002
warmup_learning_rate: 0.00001
init_learning_rate: 0.005

It is making progress…

Is there a way to emit Loss and other data out of TLT train for tensorboard? When I tried tensorboard on the logdir - it only seems to see AP-metrics tagged data from the eval intervals. Thanks

The loss can only be found in the training log.

Thanks. I have tried the spec provided elsewhere for retraining MaskRCNN using TLT. So far, after several experiments on hyperparameters, I am stuck at AP ~ 0.22 and manually cycling LR and things have not induced further progress beyond 100K steps much. I have two gpus, batch size 2. If you have a more detailed formulation of the spec for training up mask RCNN using coco per the jupyter notebook setup provided - please let me know.

BTW - TLT training mech is not very transparent - is it just purely SGD and also can one put in some sort of training adaptation algorithm to adjust the LR etc?

Refer to the spec and log, please see Poor metric results after retraining maskrcnn using TLT notebook - #17 by Morganh and Poor metric results after retraining maskrcnn using TLT notebook - #20 by Morganh

For learning rate, you can track it in training log. There are two phases. First, linear warmup to init_learning_rate. Second, learning rate will be reduced according to the specified learning_rate_steps and learning_rate_decay_levels.
If not using amp, MomentumOptimizer was used.