Lower FPS compared to the unpruned model for the pruned MaskRCNN model

• Hardware : A5000
• Network Type : Mask_rcnn
• TLT Version tao 3.22.05

I tested the fps for my pruned MaskRCNN model and it had a lower FPS compared with the unpruned model, the test was done on an A5000 machine, and I remember the GPU compute time increased from ~20ms to ~28ms, the engine file was generated with batch size 1. I have done previous such experiments with different models trained on the same version of tao toolkit and have’nt noticed this behaviour. The pruned model is roughly 60% the original model’s size and my assumption here is that lower the number of parameters lower should be the time taken for inference.

Could you please double check? If possible, please share the command, log, etc. Thanks.

Following are the commands and logs

trtexec --loadEngine=/opt/paralaxiom/vast/platform/nvast/ds_vast_pipeline/coco_2017_maskrcnn_02_09_24_step_660000.etlt_b1_gpu0_int8.engine

coco_2017_maskrcnn_02_09_24_step_660000.txt (9.0 KB)

trtexec --loadEngine=/opt/paralaxiom/vast/platform/nvast/ds_vast_pipeline/coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.etlt_b1_gpu0_int8.engine

coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.txt (8.6 KB)

After pruning, did you retrain it?
It is needed to retrain. You can check the retraining log to check if the parameters are reduced.
Then, generate tensorrt engine again(https://docs.nvidia.com/tao/tao-toolkit-archive/tao-30-2205/text/instance_segmentation/mask_rcnn.html#exporting-the-maskrcnn-model). Please make sure the tensorrt engine is smaller than unpruned tensorrt engine.
You can also generate fp32 engine and fp16 engine to narrow down.

yes the model was retrained, the pruned engine file is smaller than the unpruned one 37MB vs 51MB, so the model was definitely, pruned.

Please generate fp32 engine and fp16 engine to narrow down. Thanks.

Following are the logs for the unpruned and pruned fp16 and fp32 engines respectively
coco_2017_maskrcnn_02_09_24_v8_step_660000_fp16.txt (9.2 KB)

coco_2017_maskrcnn_02_09_24_v8_step_660000_fp32.txt (8.4 KB)
coco_2017_maskrcnn_02_09_24_v8_step_660000_pruned_1_step_660000_fp16.etlt_b1_gpu0_fp16.txt (8.6 KB)
coco_2017_maskrcnn_02_09_24_v8_step_660000_pruned_1_step_660000_fp32.etlt_b1_gpu0_fp32.txt (8.3 KB)

(upload://kCCmJw6q6n9fTphYd5hkLmP3U6S.txt) (9.2 KB)
, same trend unpruned model is faster.
Another interesting thing I noticed is that the fp16 engine files are slightly faster than the original int8 engine files, by around ~1FPS.

I have another observation, I ran a different training with only the backbone changed from resnet50 to resnet18 and the model was not pruned, I hoped for a higher FPS for the int8 engine file but interestingly this model was slower by around 12 FPS to the original unpruned resnet50 model, following is the trtexec log
coco_2017_maskrcnn_02_09_24_v10_step_660000.txt (8.4 KB)

Let’s look at the original problem at higher priority and we can solve this issue afterwards.

Hi,
May I know how did you generate the engine file? Inside 3.22.05 docker?
I will try to check if I can reproduce on latest docker firstly.

no bare metal, tensorrt version is 8.5.1-1+cuda11.8

I run with docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5. Cannot reproduce your result. I run training → pruning → retraining → generate tensorrt engine → run trtexec to check fps.
Attach the log for your reference.
20241007_mask_rcnn_forum_307832.txt (543.0 KB)

Appreciate that you did the training and fps testing from your end. We are currently trying to add maskrcnn to our set of models for deployments in the future. We have refrained from using the latest 5.x tao containers as mentioned in other forum posts from me and my team the tfrecords generated in 5.x has issues, this is why we still stick to 3.22.05 containers which have helped us till now. Kindly help us understand whether this is a model implementation difference between the 3.x and the 5.x containers that cause this difference in the results, I cannot think of any other reason for this behaviour.

A different thing is for the Tensorrt version. In 5.0.0-tf1.15.5, it is 8.5.3.1-1+cuda11.8.

I will test on this tensorrt version and share the results. Does ngc contain tensorrt containers which I can pull and use ?

You can go to https://developer.nvidia.com/nvidia-tensorrt-8x-download and find the expected version of tar.gz file for TensorRT.

For example, for 8.6.1.6, run something inside the tao docker.

$ wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/8.6.1/tars/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz
$ tar zxvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz
$ pip install TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp38-none-linux_x86_64.whl
$ export LD_LIBRARY_PATH=/home/morganh/TensorRT-8.6.1.6/lib:$LD_LIBRARY_PATH
$ mask_rcnn export xxx

Iam trying to run the trtexec command with the onnx file, after I converted it from .etlt following this reference Lower FPS for engine file with higher batch size vs engine file with lower batch size - #5 by Morganh. Iam getting the following error

[10/09/2024-05:40:18] [I] [TRT] Model version:    0
[10/09/2024-05:40:18] [I] [TRT] Doc string:
[10/09/2024-05:40:18] [I] [TRT] ----------------------------------------------------------------
[10/09/2024-05:40:18] [I] Finished parsing network model. Parse time: 0.15425
[10/09/2024-05:40:18] [E] Cannot find input tensor with name "Input" in the network inputs! Please make sure the input tensor names are correct.
[10/09/2024-05:40:18] [E] Network And Config setup failed
[10/09/2024-05:40:18] [E] Building engine failed
[10/09/2024-05:40:18] [E] Failed to create engine from model or file.
[10/09/2024-05:40:18] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=/workspace/tao-experiments/output/etlt/coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.onnx --calib=/workspace/tao-experiments/output/etlt/coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.cal --int8 --saveEngine=/workspace/tao-experiments/output/etlt/coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.onnx_b1_gpu0_int8.engine --maxShapes=Input:1x3x832x1344 --minShapes=Input:1x3x832x1344 --optShapes=Input:1x3x832x1344

You can find the command in the last line of the error, I checked the documentation MaskRCNN - NVIDIA Docs given that name for input layer in the command.
I am running the command in the tao5.x:tf1.15 docker container.

Currently, mask_rcnn can only export to .uff file instead of .onnx file.
Please use mask_rcnn export xxx to generate tensorrt engine. You can find the command in 20241007_mask_rcnn_forum_307832.txt (543.0 KB)
BTW, you can also use the command mentioned in TRTEXEC with Mask RCNN - NVIDIA Docs to generate tensorrt engine.

Iam getting “Segmentation fault (core dumped)” when I run the trtexec command inside the docker container, is it because of any permissions issue ?

Can you share the full command and full log?

trtexec --loadEngine=/workspace/tao-experiments/output/etlt/coco_2017_maskrcnn_02_09_24_step_660000_b1.tlt_gpu0_int8.engine

trtexec_log.txt (4.9 KB)