Techniques to Imporve TensorRT Model Inference Speed

Description

Ways to improve Model Inference Speed

We aim to enhance the inference speed of our object detection model without compromising accuracy.

Current Setup:

Model Task: Object Detection (3 Classes)

Hardware: NVIDIA Jetson (likely AGX Orin)

Input: Single image/frame inference

Model Versions:

PyTorch Version:

Model Size: 88 MB

Inference Latency: ~420 ms per frame

TensorRT Version (Quantized):

Model Size: 90 MB on FP16

Inference Latency: ~200 ms per frame

Objectives:

  • Further reduce latency to process high throughput (preferably below 100 ms/ per frame inference)

  • Maintain or minimally impact model accuracy on INT8 quantized version.

Environment

TensorRT Version: 10.3.0.30
GPU Type: Tegra
Nvidia Driver Version: 540.4.0
CUDA Version: 12.6.68
CUDNN Version: 9.3.0.75
Operating System + Version: Ubuntu 22.04
Python Version (if applicable): 3.10.12
Baremetal or Container (if container which image + tag): ultralytics/ultralytics:latest-jetson-jetpack6