Just Sharing!
For those planning to use Deepstream/Triton Server with YOLOv9, I highly recommend quantizing (finetune) the model for improved performance in TensorRT. I have created a repository that adds the quantization feature specifically for TensorRT.
Below, you’ll find a performance table summarizing the benefits of this approach:
Perfomance / Accuracy
TensorRT version: 10.0.0
Model
Accuracy Report
YOLOv9-C
Evaluation Results
Eval Model | AP | AP50 | Precision | Recall |
---|---|---|---|---|
Origin (Pytorch) | 0.529 | 0.699 | 0.743 | 0.634 |
INT8 (TensorRT) | 0.527 | 0.695 | 0.746 | 0.627 |
Evaluation Comparison
Eval Model | AP | AP50 | Precision | Recall |
---|---|---|---|---|
INT8 (TensorRT) vs Origin (Pytorch) | ||||
-0.002 | -0.004 | +0.003 | -0.007 |
Latency/Throughput Report using only TensorRT
Device
GPU | |
---|---|
Device | NVIDIA GeForce RTX 4090 |
Compute Capability | 8.9 |
SMs | 128 |
Device Global Memory | 24207 MiB |
Application Compute Clock Rate | 2.58 GHz |
Application Memory Clock Rate | 10.501 GHz |
Latency/Throughput
Model Name | Batch Size | Latency (99%) | Throughput (qps) | Total Inferences (IPS) |
---|---|---|---|---|
(FP16) | 1 | 1.25 ms | 803 | 803 |
4 | 3.37 ms | 300 | 1200 | |
8 | 6.6 ms | 153 | 1224 | |
12 | 10 ms | 99 | 1188 | |
INT8 | 1 | 0.99 ms | 1006 | 1006 |
4 | 2.12 ms | 473 | 1892 | |
8 | 3.84 ms | 261 | 2088 | |
12 | 5.59 ms | 178 | 2136 |
Latency/Throughput Comparison
Model Name | Batch Size | Latency (99%) | Throughput (qps) | Total Inferences |
---|---|---|---|---|
INT8 vs FP16 | ||||
1 | -20.8% | +25.2% | +25.2% | |
4 | -37.1% | +57.7% | +57.7% | |
8 | -41.1% | +70.6% | +70.6% | |
12 | -46.9% | +79.8% | +78.9% |