Please provide the following information when requesting support.
• Hardware 1080ti
• Network Type MaskRCNN
• TLT Version 3.0
• Training spec file: resnet50 in maskrcnn examples
• How to reproduce the issue ? Followed steps in the example using COCO db and supposedly pre-trained MRCNN resnet50 model from NVIDIA. From tlt_cv_samples_v1.1.0
Any assistance on getting this basic training example to work?
One would expect that doing a TLT on the existing model on COCO would just lead to a network not much different than before. I followed the existing example from NVIDIA following the quick-start for TLT
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Evaluation Performance Summary
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Average throughput: 4.0 samples/sec
[MaskRCNN] INFO : Total processed steps: 125
[MaskRCNN] INFO : Total processing time: 0.0h 08m 12s
[MaskRCNN] INFO : ==================== Metrics ====================
[MaskRCNN] INFO : AP: 0.067873947
[MaskRCNN] INFO : AP50: 0.127887100
[MaskRCNN] INFO : AP75: 0.066672139
[MaskRCNN] INFO : APl: 0.102755368
[MaskRCNN] INFO : APm: 0.091018304
[MaskRCNN] INFO : APs: 0.028439207
[MaskRCNN] INFO : ARl: 0.303018242
[MaskRCNN] INFO : ARm: 0.195692867
[MaskRCNN] INFO : ARmax1: 0.102624930
[MaskRCNN] INFO : ARmax10: 0.187838435
[MaskRCNN] INFO : ARmax100: 0.193126634
[MaskRCNN] INFO : ARs: 0.066572912
[MaskRCNN] INFO : mask_AP: 0.063729405
[MaskRCNN] INFO : mask_AP50: 0.116767384
[MaskRCNN] INFO : mask_AP75: 0.061803028
[MaskRCNN] INFO : mask_APl: 0.101339318
[MaskRCNN] INFO : mask_APm: 0.082119964
[MaskRCNN] INFO : mask_APs: 0.020874247
[MaskRCNN] INFO : mask_ARl: 0.275262952
[MaskRCNN] INFO : mask_ARm: 0.182158113
[MaskRCNN] INFO : mask_ARmax1: 0.094126195
[MaskRCNN] INFO : mask_ARmax10: 0.168800414
[MaskRCNN] INFO : mask_ARmax100: 0.172687247
[MaskRCNN] INFO : mask_ARs: 0.052210771
Let me clarify. See NVIDIA TAO Documentation , the pre-trained weights are trained on a subset of the Google OpenImages dataset. It is used as a starting point.
Currently we cannot release pre-trained weights which are trained on COCO dataset due to legal reasons.
Thanks - NVIDIA should really fix this issue - either by licensing or investing in its own comparable COCO-comparable datasets. I’m surprised at how bad the transfer-learned model is on COCO - it is not a refinement but significant re-training. I found a blog on a rather time-expensive training ImageNet then doing Mask again. Also, I am confused as I just found: NGC Catalog Entry which looks like a partially trained model on COCO2017 in the model catalog. Are there instructions on how to use this checkpoint as a starting point for TLT training?
In TLT, there is a purpose-build model – Peoplesegnet. See NVIDIA TAO Documentation . The model detects one or more “person” objects within an image and returns a box around each object, as well as a segmentation mask for each object.
This model is based on MaskRCNN with ResNet50 as its feature extractor.
Is there a way to emit Loss and other data out of TLT train for tensorboard? When I tried tensorboard on the logdir - it only seems to see AP-metrics tagged data from the eval intervals. Thanks
Thanks. I have tried the spec provided elsewhere for retraining MaskRCNN using TLT. So far, after several experiments on hyperparameters, I am stuck at AP ~ 0.22 and manually cycling LR and things have not induced further progress beyond 100K steps much. I have two gpus, batch size 2. If you have a more detailed formulation of the spec for training up mask RCNN using coco per the jupyter notebook setup provided - please let me know.
BTW - TLT training mech is not very transparent - is it just purely SGD and also can one put in some sort of training adaptation algorithm to adjust the LR etc?
For learning rate, you can track it in training log. There are two phases. First, linear warmup to init_learning_rate. Second, learning rate will be reduced according to the specified learning_rate_steps and learning_rate_decay_levels.
If not using amp, MomentumOptimizer was used.