Lower FPS compared to the unpruned model for the pruned MaskRCNN model

adithya.ajith · September 25, 2024, 6:06am

• Hardware : A5000
• Network Type : Mask_rcnn
• TLT Version tao 3.22.05

I tested the fps for my pruned MaskRCNN model and it had a lower FPS compared with the unpruned model, the test was done on an A5000 machine, and I remember the GPU compute time increased from ~20ms to ~28ms, the engine file was generated with batch size 1. I have done previous such experiments with different models trained on the same version of tao toolkit and have’nt noticed this behaviour. The pruned model is roughly 60% the original model’s size and my assumption here is that lower the number of parameters lower should be the time taken for inference.

Morganh · September 25, 2024, 7:15am

Could you please double check? If possible, please share the command, log, etc. Thanks.

adithya.ajith · September 26, 2024, 7:03am

Following are the commands and logs

trtexec --loadEngine=/opt/paralaxiom/vast/platform/nvast/ds_vast_pipeline/coco_2017_maskrcnn_02_09_24_step_660000.etlt_b1_gpu0_int8.engine

coco_2017_maskrcnn_02_09_24_step_660000.txt (9.0 KB)

trtexec --loadEngine=/opt/paralaxiom/vast/platform/nvast/ds_vast_pipeline/coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.etlt_b1_gpu0_int8.engine

coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.txt (8.6 KB)

Morganh · September 26, 2024, 7:32am

After pruning, did you retrain it?
It is needed to retrain. You can check the retraining log to check if the parameters are reduced.
Then, generate tensorrt engine again(https://docs.nvidia.com/tao/tao-toolkit-archive/tao-30-2205/text/instance_segmentation/mask_rcnn.html#exporting-the-maskrcnn-model). Please make sure the tensorrt engine is smaller than unpruned tensorrt engine.
You can also generate fp32 engine and fp16 engine to narrow down.

adithya.ajith · September 26, 2024, 9:04am

yes the model was retrained, the pruned engine file is smaller than the unpruned one 37MB vs 51MB, so the model was definitely, pruned.

Morganh · September 26, 2024, 9:06am

Please generate fp32 engine and fp16 engine to narrow down. Thanks.

adithya.ajith · September 30, 2024, 5:51am

Following are the logs for the unpruned and pruned fp16 and fp32 engines respectively
coco_2017_maskrcnn_02_09_24_v8_step_660000_fp16.txt (9.2 KB)

coco_2017_maskrcnn_02_09_24_v8_step_660000_fp32.txt (8.4 KB)
coco_2017_maskrcnn_02_09_24_v8_step_660000_pruned_1_step_660000_fp16.etlt_b1_gpu0_fp16.txt (8.6 KB)
coco_2017_maskrcnn_02_09_24_v8_step_660000_pruned_1_step_660000_fp32.etlt_b1_gpu0_fp32.txt (8.3 KB)

(upload://kCCmJw6q6n9fTphYd5hkLmP3U6S.txt) (9.2 KB)
, same trend unpruned model is faster.
Another interesting thing I noticed is that the fp16 engine files are slightly faster than the original int8 engine files, by around ~1FPS.

adithya.ajith · September 30, 2024, 5:56am

I have another observation, I ran a different training with only the backbone changed from resnet50 to resnet18 and the model was not pruned, I hoped for a higher FPS for the int8 engine file but interestingly this model was slower by around 12 FPS to the original unpruned resnet50 model, following is the trtexec log
coco_2017_maskrcnn_02_09_24_v10_step_660000.txt (8.4 KB)

Let’s look at the original problem at higher priority and we can solve this issue afterwards.

Morganh · October 7, 2024, 7:53am

Hi,
May I know how did you generate the engine file? Inside 3.22.05 docker?
I will try to check if I can reproduce on latest docker firstly.

adithya.ajith · October 7, 2024, 9:55am

no bare metal, tensorrt version is 8.5.1-1+cuda11.8

Morganh · October 7, 2024, 2:14pm

I run with docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5. Cannot reproduce your result. I run training → pruning → retraining → generate tensorrt engine → run trtexec to check fps.
Attach the log for your reference.
20241007_mask_rcnn_forum_307832.txt (543.0 KB)

adithya.ajith · October 8, 2024, 6:48am

Appreciate that you did the training and fps testing from your end. We are currently trying to add maskrcnn to our set of models for deployments in the future. We have refrained from using the latest 5.x tao containers as mentioned in other forum posts from me and my team the tfrecords generated in 5.x has issues, this is why we still stick to 3.22.05 containers which have helped us till now. Kindly help us understand whether this is a model implementation difference between the 3.x and the 5.x containers that cause this difference in the results, I cannot think of any other reason for this behaviour.

Morganh · October 8, 2024, 6:54am

A different thing is for the Tensorrt version. In 5.0.0-tf1.15.5, it is 8.5.3.1-1+cuda11.8.

adithya.ajith · October 8, 2024, 6:56am

I will test on this tensorrt version and share the results. Does ngc contain tensorrt containers which I can pull and use ?

Morganh · October 8, 2024, 7:05am

You can go to https://developer.nvidia.com/nvidia-tensorrt-8x-download and find the expected version of tar.gz file for TensorRT.

For example, for 8.6.1.6, run something inside the tao docker.

$ wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/8.6.1/tars/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz
$ tar zxvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz
$ pip install TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp38-none-linux_x86_64.whl
$ export LD_LIBRARY_PATH=/home/morganh/TensorRT-8.6.1.6/lib:$LD_LIBRARY_PATH
$ mask_rcnn export xxx

adithya.ajith · October 9, 2024, 5:45am

Iam trying to run the trtexec command with the onnx file, after I converted it from .etlt following this reference Lower FPS for engine file with higher batch size vs engine file with lower batch size - #5 by Morganh. Iam getting the following error

[10/09/2024-05:40:18] [I] [TRT] Model version:    0
[10/09/2024-05:40:18] [I] [TRT] Doc string:
[10/09/2024-05:40:18] [I] [TRT] ----------------------------------------------------------------
[10/09/2024-05:40:18] [I] Finished parsing network model. Parse time: 0.15425
[10/09/2024-05:40:18] [E] Cannot find input tensor with name "Input" in the network inputs! Please make sure the input tensor names are correct.
[10/09/2024-05:40:18] [E] Network And Config setup failed
[10/09/2024-05:40:18] [E] Building engine failed
[10/09/2024-05:40:18] [E] Failed to create engine from model or file.
[10/09/2024-05:40:18] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=/workspace/tao-experiments/output/etlt/coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.onnx --calib=/workspace/tao-experiments/output/etlt/coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.cal --int8 --saveEngine=/workspace/tao-experiments/output/etlt/coco_2017_maskrcnn_02_09_24_step_660000_pruned_1_step_660000.onnx_b1_gpu0_int8.engine --maxShapes=Input:1x3x832x1344 --minShapes=Input:1x3x832x1344 --optShapes=Input:1x3x832x1344

You can find the command in the last line of the error, I checked the documentation MaskRCNN - NVIDIA Docs given that name for input layer in the command.
I am running the command in the tao5.x:tf1.15 docker container.

Morganh · October 9, 2024, 5:59am

Currently, mask_rcnn can only export to .uff file instead of .onnx file.
Please use mask_rcnn export xxx to generate tensorrt engine. You can find the command in 20241007_mask_rcnn_forum_307832.txt (543.0 KB)
BTW, you can also use the command mentioned in TRTEXEC with Mask RCNN - NVIDIA Docs to generate tensorrt engine.

adithya.ajith · October 9, 2024, 7:01am

Iam getting “Segmentation fault (core dumped)” when I run the trtexec command inside the docker container, is it because of any permissions issue ?

Morganh · October 9, 2024, 7:35am

Can you share the full command and full log?

adithya.ajith · October 9, 2024, 8:10am

trtexec --loadEngine=/workspace/tao-experiments/output/etlt/coco_2017_maskrcnn_02_09_24_step_660000_b1.tlt_gpu0_int8.engine

trtexec_log.txt (4.9 KB)

Topic		Replies	Views
Low FPS for pruned tao toolkit models on deepstream DeepStream SDK	30	159	August 1, 2024
Mask-RCNN int8 Version Results in Poor Performance TAO Toolkit	37	1105	July 6, 2022
TX2 "INT8 not supported by platform. Trying FP16 mode" TAO Toolkit	11	2787	October 12, 2021
MAJOR ACCURACY LOSS when EXPORTING tao unet model after retraining pruned model TAO Toolkit	29	1385	November 22, 2022
Accelerating Peoplnet with tlt for jetson nano TAO Toolkit	19	2475	October 12, 2021
Lower FPS for engine file with higher batch size vs engine file with lower batch size TAO Toolkit	33	176	August 23, 2024
The effect is very poor when converted to trt TAO Toolkit tensorrt , ubuntu	61	1513	September 11, 2023
UffParser: Validator error: block_4c_bn_3/cond/Switch: Unsupported operation _Switch TAO Toolkit tensorrt	38	1438	January 11, 2022
Probleme with training/pruning tlt TAO Toolkit yolo	10	983	October 12, 2021
Inference Speed Jetson Xavier NX pytorch	6	914	April 12, 2023

Lower FPS compared to the unpruned model for the pruned MaskRCNN model

Related topics