MaskRCNN engine generates poor results when changing the number of anchor aspect_ratios


I am using the TAO framework to train a mask rcnn model. My objective is to reduce the inference latency. I am already using resnet10 backbone, fp16 encoding and reduced the model proposal numbers and outputs. After export, the TensorRT engine generates very nice predictions, corresponding to high AP values (above 80%).

To reduce further the latency, I changed the number of anchors generated by the model by changing the list of possible aspects ratios. By default, there are 3 differents ratios used by the model aspect_ratios: “[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]”. Keeping only one possible ratio in my configuration file “[(1.0, 1.0)]”, I get almost the same metrics (AP still above 80%) during training and evaluation. However, after exporting the model and using the engine file instead of the model.tlt, I get very bad qualitative results (AP would probably be below 30%). This is very unfortunate since I can reduce the latency by approximately 20% by removing these aspect ratios.

To reproduce this issue, I am using the mask_rcnn docker provided in nvidia-tao version 0.1.19. I don’t know if it is relevant here but my GPU and drivers are the following:
GPU Type : RTX3060
Nvidia Driver Version : 495.29.05
CUDA Version : 11.5
The only difference between a working configuration and the other one is the line defining the anchor aspect ratios in the configuration file : aspect_ratios: “[(1.0, 1.0), (3.0, 0.3), (0.3, 3.0)]” is replaced by aspect_ratios: “[(1.0, 1.0)]”.

Is it possible to export a TensorRT engine using a different number of aspect ratios than the 3 provided in the configuration file ?

We recommend you to raise this query in TLT forum for better assistance.


Could you please run pruning and retraining for improving inference speed?
Refer to MaskRCNN — TAO Toolkit 3.21.11 documentation

Can you double check the AP result with “[(1.0, 1.0)]” ?
You result:
tlt model: AP is above 80%
trt engine: AP is below 30%

Thank you for your reply. I already considered pruning, and I can improve the inference speed by about 10% with stable AP. To further reduce the latency, another option is int8 inference but I could not export the engine with satisfying inference speed. My guess is that some of the mask rcnn layers are not available with int8 precision. I could also reduce the input size but AP decreases significantly.

I observed that there is about 20% latency difference between an empty/random image generated by trtexec and inference on “real” images. This can be explained by the number of proposals to filter out with this kind of architecture, and that’s why I tried to reduce the number of anchor ratios.

To be honest, I checked and reproduced the results many times to be sure that it was not a configuration problem. Using “[(1.0, 1.0)]”, the training / evaluation AP measured by “tao mask_rcnn train / evaluate” is very good, similar to the one obtained with “[(1.0, 1.0), (3.0, 0.3), (0.3, 3.0)]” (respectively 87 and 88% for AP75). However the engine exported using the model with fewer ratios generates very bad predictions, unusable in practice (I would say below AP below 30). I am not able to measure AP using the engine directly, “tao mask_rcnn evaluate” outputs the error “The pruned model must be retrained first” even if I did not prune the model.

That is not expected. Usually the .tlt model should has similar AP against .trt engine.
Can you set a lower threshold and retry?
More, how did you check the AP of the .trt engine? Can you share the full command and full log?

That’s not expected. Please share the full command and full log as well.

The lower threshold (I tried 0.01) does not change anything, most predictions are associated to very low confidence values. To check AP of the .tlt engine, I use the command line

!tao mask_rcnn evaluate -e $CONFIG_DIR/$EXP_NAME.txt \
                        -m $RESULT_DIR/$EXP_NAME/model.step-$NUM_STEP.tlt \
                        -k $KEY

where NUM_STEP is the number of training iterations and EXP_NAME the base name of the configuration file used for the experiment. Please find the configuration files for the two models (only difference is the aspect_ratios values), as well as the full log obtained when evaluating the model with only one aspect_ratio.
resnet18.txt (1.9 KB)
resnet18-b.txt (1.9 KB)
resnet18-b-log.txt (379.1 KB)

I export the models using the following command

!tao mask_rcnn export -k $KEY \
                      -e $CONFIG_DIR/$EXP_NAME.txt \
                      -m $RESULT_DIR/$EXP_NAME/model.step-$NUM_STEP.tlt \
                      -o $EXPORT_DIR/$EXP_NAME/$EXP_NAME.etlt \
                      --engine_file $EXPORT_DIR/$EXP_NAME/$EXP_NAME.engine \
                      --batch_size 2 \
                      --data_type fp16

However in the case of the second model, the .engine outputs bad predictions. I can reproduce this behaviour by changing data_type into fP32, batch size, and by setting other aspect_ratios values. The only way to get correct predictions is to set exactly 3 aspect ratios.

Finally, to measure AP using the generated engines I am using the following command

!tao mask_rcnn evaluate -e $CONFIG_DIR/$EXP_NAME.txt \
                        -m $EXPORT_DIR/$EXP_NAME/$EXP_NAME.engine \
                        -k $KEY

The -k parameter is described as unnecessary but it generates an error if omitted. I also joined the full output of this command, which is the same for every engine I am testing. The output is the same if I replace .engine by the .etlt model, and as said earlier I did not prune these models at all.
eval_engine.txt (1.6 KB)

Firstly, may I know how did you check the inference speed?
Can you share your way/command/etc ?

In the early tests I used trtexec with the following command

trtexec --loadEngine=resnet18.engine --batch=2

Now I measure inference using directly TensorRT in my project :

    auto start = high_resolution_clock::now();
    bool status = mContext->execute(batchSize, mBuffer->getDeviceBindings().data());
    auto stop = high_resolution_clock::now();

I measure the same GPU latency with both methods, but I have to feed the engine with empty images in my own code since inference speed is slower on real images. If I measure inference speed on the engine generated with fewer anchor ratios, the measured time decreases from about 26 ms to 20 ms. However, I am not able to use the engine outputs since the predictions are not accurate any more.

I need to try to reproduce your error. May I know if you are using a public dataset or not?

I am unfortunately using a private dataset. However it would help if you could check on your side if you can (or not) reproduce this behavior. If changing the number of aspect ratios on your size does not impact the engine accuracy after export, there may be something I am missing.

If the .engine or .etlt model does not contains ‘step’, it will prompt this error.

More,may I know how did you get the result of “I would say below AP below 30”? Running tao inference and then check with your own script?

For changing aspec_ratios, please modify TensorRT/tlt_mrcnn_config.h at main · NVIDIA/TensorRT ( and then build a new . Then replace the current one.
Reference: YOLOv4 — TAO Toolkit 3.21.11 documentation

I ran tao inference and looked at the generated images and masks. Since there are at most one correct segmentation mask per image, and that there are about 3 objects / image in average, I am quite sure that the corresponding AP is below 30. But the value is not relevant here, it was only to emphasize that I was confident in the qualitative and quantitative difference between both models.

Thank you for your help. I will give it a try when I’ll have some time. If I understand well, I should directly modify the hard-coded aspect_ratios in tlt_mrcnn_config ? Does this mean that the value set in the training config file is not correctly loaded/adjusted at runtime ?

Yes, for the case of running inference against tensorrt engine.