Poor performance of MaskRCNN on images

Over the last few weeks I have trained the model achieved following (which are inline with the blog published earlier on MaskRCNN)

AP: 0.330607116
AP50: 0.532350004
AP75: 0.354983419
APl: 0.450090349
APm: 0.347090811
APs: 0.174515933
ARl: 0.655892372
ARm: 0.527796328
ARmax1: 0.294067711
ARmax10: 0.472680926
ARmax100: 0.498188853
ARs: 0.307220429
mask_AP: 0.303248495
mask_AP50: 0.500997007
mask_AP75: 0.322716862
mask_APl: 0.430260897
mask_APm: 0.319809288
mask_APs: 0.143792033
mask_ARl: 0.623010099
mask_ARm: 0.487232208
mask_ARmax1: 0.277731419
mask_ARmax10: 0.433856368
mask_ARmax100: 0.455417335
mask_ARs: 0.262938172

However when i did the inference the results were quite disappointing. I did the inference using following command in the notebook:

!tlt-infer mask_rcnn --trt
-i $DATA_DOWNLOAD_DIR/raw-data/test2017
-o $USER_EXPERIMENT_DIR/maskrcnn_annotated_images
-e $SPECS_DIR/maskrcnn_train_resnet50.txt
-m $USER_EXPERIMENT_DIR/export/model.step-720000.engine
-l $USER_EXPERIMENT_DIR/maskrcnn_annotated_labels
-c $SPECS_DIR/coco_labels.txt
-t 0.5

Please see the link below for the results:


My requests are:

1: Are there any settings/options which I can additionally apply to get better results for segmentation?

2: Is there demo application which can be used to test segmentation of picture under deepstream environment? By the way I am aware of demo application which can be used for video (analysing already at the moment). Just wondering is there any app which can be used or tweaked to use for segmentation of pictures.

Many thanks.

  1. Firstly, could you give more details about your “disappointing” at the result pictures?
  2. See Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation

My apologies. I should have been more clear.

1: I have now placed two images in the link: https://drive.google.com/drive/folders/1ElF9NwIvUtD94QNhMEmz76h2bG5ArQ_t?usp=sharing


I have also marked the file with black crosses (X) in areas which are not segmented (but should have been segmented). Generally this is true for most cases especially around edges (e.g. part of head or arm or legs do not get segmented).

2: Many thanks for the link. Yes I have used that link and was able to process video (i.e. provide video as input, and show output as osd or write to file etc.). My request was how to provide picture as input (e.g. input.jpg) and get picture as output (e.g. output_segmented.jpg).

3: forgot to ask something earlier. tlt-infer also generate another file (json file and in this case 14.json; also shared at above link). This file have some segmentation numbers

“area”: 4917250,
“bbox”: [
“category_id”: 1,
“id”: 0,
“image_id”: “14.jpg”,
“is_crowd”: 0,
“segmentation”: [

What does area and segmention means?

Thank you.

  1. I will check further
  2. Please try GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream

For 1), is the 14_original.jpg from coco2017/test folder? I just check several jpg files under coco2017 test folder. Mostly they get segmented.

BTW, I run tlt-infer against tlt model instead of trt engine. See below.

tlt-infer mask_rcnn -i /workspace/tlt-experiments/mask_rcnn/coco/raw-data/test2017_small_part
-o ./maskrcnn_annotated_images
-e /workspace/tlt-experiments/mask_rcnn/spec_blog_2gpu.txt
-m /workspace/tlt-experiments/mask_rcnn/result_blog_2gpu/model.step-720000.tlt
-l ./coco_labels.txt
-t 0.5
-b 2
-k key

1: no it is not in coco2017/test folder - a simple input file without complicated background. if we are just doing well on mscoco but not on other images then somehow we have overfitted the model on mscoco by doing 720k iterations

unless of course if I am missing something

2: ok i’ll try model.step-720000.tlt as well in case if that makes difference in accuracy. apparently trt.enge is more optimized from performance point of view

3: tlt-infer also generate another file (json file and in this case 14.json; also shared at above link). This file have some area and segmentation numbers

“area”: 4917250,
“bbox”: [
“category_id”: 1,
“id”: 0,
“image_id”: “14.jpg”,
“is_crowd”: 0,
“segmentation”: [

What does area and segmentation means?

  1. Your model is trained with coco2017 training folder. Currently running inference against coco2017 test folder. It is not overfitted since the inference result seems to be fine. For other images instead of coco dataset, suggest running new training against the new dataset.
  2. The area means width * height. In bbox[3988,1070,1513,3250], it means bbox’s [ x1, y1, height, width]. So, actually 4917250 = 1513 x 3250.
    The segmentation is the polygon pixel value of segmentation.

Many thanks for your comments. I’ll run the extended inference on coco2017 test folder as well.

I am a little bit confused though so your additional expert opinion will be much appreciated

1: if the model is already trained for a particular class e.g. person then what is reason for additional training for same class i.e. person. i can understand the rationale for additional training if the class is new e.g. if model is trained for car and we need to do inference on person then yes additional training is required.

in this particular case of person, coco2017 training folder has a wide range of various picture of persons so I am wondering what additional can be done?

2: thanks. i am assuming every single value in array corresponds to one pixel? e.g. 5205.0 is for one pixel in polygon. wondering how 5205.0 is mapping to 3 bytes of (RGB if image is JPG) or 4 bytes of (RGBA if image is PNG with alpha channel). The maximum value each byte can hold is unsigned decimal 255.

Thank you.

  1. Sorry for confusion. What I mean is that it is necessary to trigger a new full training with your own dataset instead of coco2017. As we known, a model which is trained by dataset_1 may not have an expected inference result against dataset_2.
    More, you can also run additional experiment to check if there is something wrong in tlt-infer. Try to run "tlt-infer mask_rcnn " instead of “tlt-infer mask_rcnn --trt” to check the results.

  2. Sorry, my bad. It is not pixel value. It is coordinate value of segmentation.

Thank you for further comments on this.

Earlier we did MaskRCNN training on coco2017 for 720K iterations (resolution of images 640x520, 640x420 etc.) and spec file says

image_size: “(832, 1344)”

I am assuming this means input to the network must be 832x1344 which means during training images were re-scaled resolution wise from 640x425 (size in coco2017) to 1344x832 (size required by network which will be trained) so images are resized and annotations are also changed by the TLT framework. Is it correct understanding of what is happening behind the scenes in training the MaskRCNN network?

In the context of training for brand new dataset:
Suppose I want to do instance-segmentation inference for class person on high quality larger images say 4k (resolution 4096 × 2160). What should be the minimum resolution size of the images in new dataset? I understand more the images better it would be. However what should be minimum number of training images which are required to get some decent results?

Thank you.

Yes, during training, there is resizing. But the input is not a must to be 1344x832.
See below in tlt user guide.


  • Input size: C * W * H (where C = 3, W > =128, H >=128 and W, H are multiples of 32)
  • Image format: JPG
  • Label format: COCO detection

For training images, it is hard to determinate the minimum number. Suggest you to try small part for initial training experiment, if result is not expected, then increase it to trigger more experiments.

Normally , you can set to 4096x2144 or 4096x2176. But since there is an issue, see Error running MaskRCNN inference after custom training - #5 by Morganh, so please set to 4096x2176.

I have been trying to figure out the best way to do training. Here is what I am trying to:

  1. My inputs:
    Input 1: COCO Instance Segmentation Baselines with Mask R-CNN (already available from Nvidia model repo)
    Input 2: My new custom dataset of ‘cat’ class (high resolution images)

  2. Expected outputs:
    Transfer learn baseline model “COCO Instance Segmentation Baselines with Mask R-CNN” with new custom data of high quality/resolution of one particular class e.g. ‘cat’ which is already in cocodataset. However the goal is to retain the ability of segmentation of the rest of the classes as well. Means the transfer-learned-trained model perform better on ‘cat’ class due to new custom training images but retaining the ability to segment as per quality of the basline model

The class label for this new transfer learning is same as one of cocodataset e.g. ‘cat’. What is the best way to do this transfer learning training using TLT:

Option 1: develop a new dataset in coco format (with label as ‘cat’ and do transfer-learning or retraining using methods in TLT documentation) - would there be any conflict with what has already be learned as that was learned on different resolution etc. and new resolution of training images will be higher

Option 2: develop a new dataset in coco format with NEW LABEL as ‘mycat’ which will be new class. In this case there is no conflict with existing classes but I don’t know if as a process of re-training the new model might be missed out few things which have already been learned

Could you please a advise which of the options is best to get refinements in one of existing classes in cocodataset with using transfer learning via TLT.

Many thanks.

Firstly, one question about your expected output. You wants “perform better on ‘cat’ class”. Does the ‘cat’ class mean
1) new cat dataset with high quality/resolution
or 2) existing coco cat dataset
or 3) existing coco cat dataset + new cat dataset with high quality/resolution

Which dataset will you mostly care about when run inference?

The overall objective is to improve the segmentation of the ‘cat’ class. By improve I mean, to improve the quality of the generated masks (as a result of instance segmentation).

1: ideally i am envisioning existing ‘cat’ class of cocodataset

means use the existing ‘cat’ class of cocodataset and refine the instance segmentation of ‘cat’ class without loosing the capability of learned weights for rest of all classes in cocodataset.

I am caring most about cocodataset. The intention is to improve the results of specific classes by adding more high-resolution/variety of images for those specific classes and using transfer learning.

I am approaching this problem as:

1:Take the existing network already trained on cocodataset (train2017)
2: Take the pretrained weights
3: Develop a newdataset for ‘cat’ label/class (means this label is same as in original cocodataset) of high resolution images and more variety of images at higher resolution

Retrain the network using TLT to do transfer learning and have better results maybe due to variety of images. But not due to resolution because the resolution will be reduced to the level network was originally trained. Is that correct?

I understand that network has to be trained from scratch if for specific resolution such as 1080p if the expectation is inference should also happen at 1080p without reducing the resolution. In that case I’ll loose all other classes of cocodataset if I am correct.

So basically is there anyway that I could keep the transfer learning element but extend transfer learning training for one particular class to do training at high resolution e.g. 1920x1080 opposed to originally trained resolution 1344x832. My initial thoughts were it is not possible but then I have limited knowledge.

Actually you can refine the cat class with more your own images. But I think it is not related to the resolution. It is just due to more images you will add into the cat class. Because the cat class inside the coco dataset has variety of resolution. So, even you add some 1920x1080 images, it is still training a 1344x832 network. Yes, you can take the existing network you have trained as the pretrained tlt model and do transfer learning.