Improving the speed for fp32 for yolov10x inference from Ultralytics on Jetson AGX Orin 64g devkit

I have connected two RealSense D435 cameras to the Jetson AGX Orin 64g devkit (cameras show independent output and only detect the object).

I was under the assumption that using int8 optimization for trt engine will yield a faster optimization.

However, it is not like so.

currently, this is the speed I get for fp32 engine which is also variable and not same??:

0: 640x640 1 person, 48.8ms
Speed: 10.1ms preprocess, 48.8ms inference, 20.4ms postprocess per image at shape (1, 3, 640, 640)
 
0: 640x640 1 person, 56.4ms
Speed: 15.4ms preprocess, 56.4ms inference, 9.2ms postprocess per image at shape (1, 3, 640, 640)
255

If I use int8, it is not that much faster and also starts not being precise (as expected), doesn’t recognize a person when there is a person (despite 0.5 confidence threshold)

0: 640x640 (no detections), 33.9ms
Speed: 7.0ms preprocess, 33.9ms inference, 2.4ms postprocess per image at shape (1, 3, 640, 640)
 
0: 640x640 2 persons, 23.2ms
Speed: 10.3ms preprocess, 23.2ms inference, 9.5ms postprocess per image at shape (1, 3, 640, 640)
255

For each camera, I am using their dedicated model which is technically same file I have made a copy of:

model1 = YOLO("yolov10x_cam1_fp32.engine", task="detect")
model2 = YOLO("yolov10x_cam2_fp32.engine", task="detect")

Please note the RGB camera is real-time at 30 FPS. My goal is to have a (near) realtime feel for both cameras when they are connected. Currently, it feels a bit laggy.

Some further info:

$ pip show ultralytics
Name: ultralytics
Version: 8.2.90
Summary: Ultralytics YOLOv8 for SOTA object detection, multi-object tracking, instance segmentation, pose estimation and image classification.
Home-page: 
Author: Glenn Jocher, Ayush Chaurasia, Jing Qiu
Author-email: 
License: AGPL-3.0
Location: /home/mona/.local/lib/python3.10/site-packages
Requires: matplotlib, numpy, opencv-python, pandas, pillow, psutil, py-cpuinfo, pyyaml, requests, scipy, seaborn, torch, torchvision, tqdm, ultralytics-thop
Required-by: 

$ python
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux

$ uname -a
Linux ubuntu 5.15.136-tegra #1 SMP PREEMPT Mon May 6 09:56:39 PDT 2024 aarch64 aarch64 aarch64 GNU/Linux
mona@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy
mona@ubuntu:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:08:11_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


![367402750-78142adc-f703-4eb2-9754-ac323f359e66|690x332](upload://tLmrMU5otTvWF8CTHr07MGUxtyE.png)

Hi,

Have you tried this with TensorRT?
More, could you run the tegrastats at the same time and share the output with us?

Thanks.

Yes I converted the model to an engine using fp32 (if you don’t use int8, the default is fp32) via Ultralytics API in Python.

I also used the Jetson clock command.

Can you explain further on this?

Also, when I run jetson clock, I keep getting overcurrent and trottled warning but the temperature is not too high.

Hi,

Please run the following command and share the output with us.

$ sudo tegrastats

The command will share the CPU/GPU utilization so we can know if hardware resources are saturated or not.

Thanks.