Optimize .NET Real-Time Video Pipeline with Multiple TensorRT Models — Low GPU Utilization & Throughput Bottleneck

We have a .Net pipeline that reads real-time video streams and pass them through the following models sequentially:

### Pipeline Flow

```

Video Stream

    │

    ▼

┌─────────────────┐

│  YOLO Model 1   │  (Detection)

└────────┬────────┘

         │

         ▼

┌─────────────────┐     ┌─────────────────┐

│  YOLO Model 2a  │ ──► │  YOLO Model 2b  │  (Cascaded Detection)

└─────────────────┘     └────────┬────────┘

                                 │  The following model are running sequentially 

         ┌───────────────────────┼───────────────────────|─────────────────────┐

         ▼                       ▼                       ▼                     ▼

┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐

│    Feature      │   │   Regression    │   │   Classifier    │   │    Keypoint     │

│   Extraction    │   │     Model       │   │     Model       │   │   Detection     │(Lightweight)
└─────────────────┘   └─────────────────┘   └─────────────────┘             └─────────────────┘

```

### Technical Stack

| Component | Technology |

|-----------|------------|

| Runtime | .NET (version: 8) |

| Inference | TensorRT (FP16) |

| Preprocessing | OpenCvSharp |

| Batch Processing | ✅ Implemented |


-–

The models run in FP16 using TenosrRT, and each model needs preprocessing before inference (usually done using opencvsharp)

I was able to modify all the models to process images in batches
However, I noticed that the GPU utilization is low and the frame rate is less than the expected throughput

I believe that there is a bottleneck in my flow, which should be eliminated.

I tried using the Triton server, but after doing some tests, it appeared to be slower than normally loading and serving my models.

This issue is especially apparent in Bigger GPUs
On my RTX4050 laptop GPU, the usage was only about 50% for over 6 concurrent streams with a total throughput of 12.5FPS
While on a bigger L4 GPU, the same pipeline yieled 10% GPU usage with approximately the same total throughput.

What improvements can be made?