Optimize .NET Real-Time Video Pipeline with Multiple TensorRT Models — Low GPU Utilization & Throughput Bottleneck

hab1 · February 2, 2026, 10:41am

We have a .Net pipeline that reads real-time video streams and pass them through the following models sequentially:

### Pipeline Flow

```

Video Stream

    │

    ▼

┌─────────────────┐

│  YOLO Model 1   │  (Detection)

└────────┬────────┘

         │

         ▼

┌─────────────────┐     ┌─────────────────┐

│  YOLO Model 2a  │ ──► │  YOLO Model 2b  │  (Cascaded Detection)

└─────────────────┘     └────────┬────────┘

                                 │  The following model are running sequentially 

         ┌───────────────────────┼───────────────────────|─────────────────────┐

         ▼                       ▼                       ▼                     ▼

┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐

│    Feature      │   │   Regression    │   │   Classifier    │   │    Keypoint     │

│   Extraction    │   │     Model       │   │     Model       │   │   Detection     │(Lightweight)
└─────────────────┘   └─────────────────┘   └─────────────────┘             └─────────────────┘

```

### Technical Stack

| Component | Technology |

|-----------|------------|

| Runtime | .NET (version: 8) |

| Inference | TensorRT (FP16) |

| Preprocessing | OpenCvSharp |

| Batch Processing | ✅ Implemented |

-–

The models run in FP16 using TenosrRT, and each model needs preprocessing before inference (usually done using opencvsharp)

I was able to modify all the models to process images in batches
However, I noticed that the GPU utilization is low and the frame rate is less than the expected throughput

I believe that there is a bottleneck in my flow, which should be eliminated.

I tried using the Triton server, but after doing some tests, it appeared to be slower than normally loading and serving my models.

This issue is especially apparent in Bigger GPUs
On my RTX4050 laptop GPU, the usage was only about 50% for over 6 concurrent streams with a total throughput of 12.5FPS
While on a bigger L4 GPU, the same pipeline yieled 10% GPU usage with approximately the same total throughput.

What improvements can be made?

Topic		Replies	Views
Running Real-Time Instance Segmentation with Local GPUs TensorRT tensorrt , camera , ros , python , cudnn	2	140	February 18, 2025
[TensorRT] Speed of concurrent execute multiple TensorRT model on one GPU TensorRT tensorrt	1	1867	May 24, 2020
GPU slowdown with multiple streaming TensorRT	4	930	March 4, 2020
Batching vs CUDA Streams for concurrent inferences? TensorRT tensorrt , cuda	7	2236	October 12, 2021
Deploying Deep Neural Networks with NVIDIA TensorRT Technical Blog	17	894	October 8, 2017
Want to halve inference time TensorRT	7	918	December 25, 2023
Multi-model parallel inferencing TensorRT	1	446	March 31, 2023
Running two models in multiple models increases the FPS TensorRT tensorrt , cuda , python	1	1503	October 28, 2020
Scaling problem using Triton server and RTSP Multi-stream DeepStream SDK	39	2127	July 9, 2024
Low GPU utilization DeepStream SDK tensorrt , cuda , ubuntu , gstreamer , deepstream	18	1656	June 8, 2023

Optimize .NET Real-Time Video Pipeline with Multiple TensorRT Models — Low GPU Utilization & Throughput Bottleneck

Related topics