Description
We are trying to deploy multiple real-time instance segmentation models for robotic systems, processing inputs from multiple camera feeds simultaneously. We are testing the deployment on a NVIDIA RTX 4090 GPU with CUDA 12.4. We are trying to run two models simultaneously one with11.8M-parameters and the other with 23.7M-parameters. Each of these models are inferencing on 4 live streams from 4 cameras. Each model using batch inferencing to predict on the 4 streams at once. The two models are running on 2 different ros2 nodes. When running either of the models individually, the predictions come at around 160 fps for the small one and around 90 to 100 for the big one. But when both the models are running together, the inference for the large model drops to around ~50 - 60 fps while the small one drops to around 70 - 80 fps. This affects the overall performance of our code, and accounting for the postprocessing that we do on the predictions, our requirement for 60fps becomes hard to meet.
Gpu utilization when running both the models together is around 74%. We believe it might be due to context switching in the gpu when both the models are running parallelly, also, im afraid if i use cuda streams or manually partition the gpu, the individual
performance of the models will drop as itll have less gpu recourses to use.
Environment
TensorRT Version: 10.5
GPU Type: RTX 4090
Nvidia Driver Version: 540.45
CUDA Version: 12.4
CUDNN Version: Not Really Using
Operating System + Version: Linux 22.04 and ROS2
Python Version (if applicable): 3.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.1
Baremetal or Container (if container which image + tag):
If anyone has experienced something like this, or just generally have experience in optimizing multi-model inference on GPUs or has suggestions for configurations, tools, or techniques to improve real-time performance, I would greatly appreciate your insights. Thank you in advance for your help!