DeepStream/Triton Server - YOLOv9 QAT

Just Sharing!

For those planning to use Deepstream/Triton Server with YOLOv9, I highly recommend quantizing (finetune) the model for improved performance in TensorRT. I have created a repository that adds the quantization feature specifically for TensorRT.

Below, you’ll find a performance table summarizing the benefits of this approach:

Perfomance / Accuracy

TensorRT version: 10.0.0

Model

YOLOv9-C-converted

Accuracy Report

YOLOv9-C

Evaluation Results

Eval Model AP AP50 Precision Recall
Origin (Pytorch) 0.529 0.699 0.743 0.634
INT8 (TensorRT) 0.527 0.695 0.746 0.627

Evaluation Comparison

Eval Model AP AP50 Precision Recall
INT8 (TensorRT) vs Origin (Pytorch)
-0.002 -0.004 +0.003 -0.007

Latency/Throughput Report using only TensorRT

Device

GPU
Device NVIDIA GeForce RTX 4090
Compute Capability 8.9
SMs 128
Device Global Memory 24207 MiB
Application Compute Clock Rate 2.58 GHz
Application Memory Clock Rate 10.501 GHz

Latency/Throughput

Model Name Batch Size Latency (99%) Throughput (qps) Total Inferences (IPS)
(FP16) 1 1.25 ms 803 803
4 3.37 ms 300 1200
8 6.6 ms 153 1224
12 10 ms 99 1188
INT8 1 0.99 ms 1006 1006
4 2.12 ms 473 1892
8 3.84 ms 261 2088
12 5.59 ms 178 2136

Latency/Throughput Comparison

Model Name Batch Size Latency (99%) Throughput (qps) Total Inferences
INT8 vs FP16
1 -20.8% +25.2% +25.2%
4 -37.1% +57.7% +57.7%
8 -41.1% +70.6% +70.6%
12 -46.9% +79.8% +78.9%

Full Report

1 Like

Thanks for sharing to the community!

1 Like

Added New improvements.

Evaluation Results

Activation SiLU

Eval Model AP AP50 Precision Recall
Origin (Pytorch) 0.529 0.699 0.743 0.634
INT8 (Pytorch) 0.529 0.702 0.742 0.63
INT8 (TensorRT) 0.529 0.696 0.739 0.635

Activation ReLU

Eval Model AP AP50 Precision Recall
Origin (Pytorch) 0.519 0.69 0.719 0.629
INT8 (Pytorch) 0.518 0.69 0.726 0.625
INT8 (TensorRT) 0.517 0.685 0.723 0.626

Latency