Running Real-Time Instance Segmentation with Local GPUs

rajithalder1698 · January 27, 2025, 5:49pm

Description

We are trying to deploy multiple real-time instance segmentation models for robotic systems, processing inputs from multiple camera feeds simultaneously. We are testing the deployment on a NVIDIA RTX 4090 GPU with CUDA 12.4. We are trying to run two models simultaneously one with11.8M-parameters and the other with 23.7M-parameters. Each of these models are inferencing on 4 live streams from 4 cameras. Each model using batch inferencing to predict on the 4 streams at once. The two models are running on 2 different ros2 nodes. When running either of the models individually, the predictions come at around 160 fps for the small one and around 90 to 100 for the big one. But when both the models are running together, the inference for the large model drops to around ~50 - 60 fps while the small one drops to around 70 - 80 fps. This affects the overall performance of our code, and accounting for the postprocessing that we do on the predictions, our requirement for 60fps becomes hard to meet.

Gpu utilization when running both the models together is around 74%. We believe it might be due to context switching in the gpu when both the models are running parallelly, also, im afraid if i use cuda streams or manually partition the gpu, the individual
performance of the models will drop as itll have less gpu recourses to use.

Environment

TensorRT Version: 10.5
GPU Type: RTX 4090
Nvidia Driver Version: 540.45
CUDA Version: 12.4
CUDNN Version: Not Really Using
Operating System + Version: Linux 22.04 and ROS2
Python Version (if applicable): 3.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.1
Baremetal or Container (if container which image + tag):

If anyone has experienced something like this, or just generally have experience in optimizing multi-model inference on GPUs or has suggestions for configurations, tools, or techniques to improve real-time performance, I would greatly appreciate your insights. Thank you in advance for your help!

AakankshaS · January 31, 2025, 8:13am

Hi @rajithalder1698 ,
This forum talks about issues specific to TRT.
I would recommend you to raise thi sto appropriate forum.

Thanks

rajithalder1698 · February 18, 2025, 8:23pm

what would be the appropriate forum for this issue to go on?

Topic		Replies	Views
Is it possible to run multiple TensorRT model inference on a GPU simultaneously and parallelly? TensorRT tensorrt , cuda	3	2021	August 23, 2022
Real Time Inference with Multi GPU - Multiple Model Triton Inference Server - archived	1	1396	January 29, 2020
[TensorRT] Speed of concurrent execute multiple TensorRT model on one GPU TensorRT tensorrt	1	1773	May 24, 2020
How to inference with tensorrt on multi gpus in python TensorRT	2	2179	April 9, 2021
Not able to inference multiple input models using TRT TensorRT tensorrt , tensorflow , jetson-inference	1	442	August 12, 2021
Running two models in multiple models increases the FPS TensorRT tensorrt , cuda , python	1	1433	October 28, 2020
How to run multi trt model instance in single gpu efficentilly? TensorRT	0	688	June 20, 2019
Use multiple CUDA streams with multiple TensorRT models Jetson AGX Orin tensorrt , cuda	3	405	December 26, 2023
Multi-model parallel inferencing TensorRT	1	383	March 31, 2023
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1186	May 11, 2021

Running Real-Time Instance Segmentation with Local GPUs

Description

Environment

Related topics