I can share YOLOv8-small-P2 1024x1024 on a Jetson Orin 64GB. P2 is a Pyramid Pooling layer addition that can also be enabled on YOLOv5, this layer helps to detect small objects but it comes at a computation expense. The performance logs below were done on Deepstream 6.4 as well which uses TensorRT 8.6.2.3. Apparently there is a big performance boost when using Deepstream 7.1 which uses TensorRT10.3.0.31.
FP32(50.2 FPS):
[04/10/2025-19:06:13] [I] === Performance summary ===
[04/10/2025-19:06:13] [I] Throughput: 52.432 qps
[04/10/2025-19:06:13] [I] Latency: min = 19.7522 ms, max = 20.2943 ms, mean = 19.9232 ms, median = 19.9209 ms, percentile(90%) = 19.9353 ms, percentile(95%) = 19.9408 ms, percentile(99%) = 19.9521 ms
[04/10/2025-19:06:13] [I] Enqueue Time: min = 1.24695 ms, max = 1.31693 ms, mean = 1.27 ms, median = 1.26599 ms, percentile(90%) = 1.28882 ms, percentile(95%) = 1.29651 ms, percentile(99%) = 1.31677 ms
[04/10/2025-19:06:13] [I] H2D Latency: min = 0.840942 ms, max = 0.866699 ms, mean = 0.852472 ms, median = 0.852844 ms, percentile(90%) = 0.858582 ms, percentile(95%) = 0.86084 ms, percentile(99%) = 0.864502 ms
[04/10/2025-19:06:13] [I] GPU Compute Time: min = 18.8267 ms, max = 19.3242 ms, mean = 18.9532 ms, median = 18.9506 ms, percentile(90%) = 18.9625 ms, percentile(95%) = 18.9667 ms, percentile(99%) = 18.985 ms
[04/10/2025-19:06:13] [I] D2H Latency: min = 0.067627 ms, max = 0.124146 ms, mean = 0.117593 ms, median = 0.116943 ms, percentile(90%) = 0.121582 ms, percentile(95%) = 0.12207 ms, percentile(99%) = 0.123474 ms
[04/10/2025-19:06:13] [I] Total Host Walltime: 3.05157 s
[04/10/2025-19:06:13] [I] Total GPU Compute Time: 3.03251 s
FP16(98.2 FPS):
[04/10/2025-19:05:03] [I] === Performance summary ===
[04/10/2025-19:05:03] [I] Throughput: 105.899 qps
[04/10/2025-19:05:03] [I] Latency: min = 10.0845 ms, max = 10.4376 ms, mean = 10.1814 ms, median = 10.1794 ms, percentile(90%) = 10.193 ms, percentile(95%) = 10.1981 ms, percentile(99%) = 10.2063 ms
[04/10/2025-19:05:03] [I] Enqueue Time: min = 1.18372 ms, max = 1.2373 ms, mean = 1.2027 ms, median = 1.19922 ms, percentile(90%) = 1.22137 ms, percentile(95%) = 1.22437 ms, percentile(99%) = 1.22864 ms
[04/10/2025-19:05:03] [I] H2D Latency: min = 0.671265 ms, max = 0.764832 ms, mean = 0.684166 ms, median = 0.682861 ms, percentile(90%) = 0.689697 ms, percentile(95%) = 0.693848 ms, percentile(99%) = 0.704102 ms
[04/10/2025-19:05:03] [I] GPU Compute Time: min = 9.33789 ms, max = 9.66742 ms, mean = 9.41251 ms, median = 9.41162 ms, percentile(90%) = 9.422 ms, percentile(95%) = 9.42456 ms, percentile(99%) = 9.4353 ms
[04/10/2025-19:05:03] [I] D2H Latency: min = 0.0671387 ms, max = 0.097229 ms, mean = 0.0847006 ms, median = 0.0845337 ms, percentile(90%) = 0.0860596 ms, percentile(95%) = 0.0865479 ms, percentile(99%) = 0.0872803 ms
[04/10/2025-19:05:03] [I] Total Host Walltime: 3.03118 s
I can’t seem to find the int8 calibration one I did but it will roughly half the latency mean of FP16 but it comes at a cost of a hit to accuracy 6-7% in my experience.
For a Jetson Orin Nano, depending on your number of feeds you are going to probably want to use a nano model if you have a lot of feeds. Otherwise, if using the standard Yolov5-small then you for sure need to stick to FP16 or INT8.
I pulled the latest Deepstream 7.1 Docker to do a test(nvcr.io/nvidia/deepstream:7.1-triton-multiarch) and got the following results for FP16:
[04/10/2025-19:08:33] [I] === Performance summary ===
[04/10/2025-19:08:33] [I] Throughput: 111.74 qps
[04/10/2025-19:08:33] [I] Latency: min = 9.53931 ms, max = 9.92157 ms, mean = 9.67231 ms, median = 9.67126 ms, percentile(90%) = 9.68427 ms, percentile(95%) = 9.6875 ms, percentile(99%) = 9.70447 ms
[04/10/2025-19:08:33] [I] Enqueue Time: min = 1.50317 ms, max = 1.5874 ms, mean = 1.53303 ms, median = 1.52954 ms, percentile(90%) = 1.55237 ms, percentile(95%) = 1.55933 ms, percentile(99%) = 1.57642 ms
[04/10/2025-19:08:33] [I] H2D Latency: min = 0.588745 ms, max = 0.685242 ms, mean = 0.604583 ms, median = 0.604248 ms, percentile(90%) = 0.610352 ms, percentile(95%) = 0.61377 ms, percentile(99%) = 0.620605 ms
[04/10/2025-19:08:33] [I] GPU Compute Time: min = 8.86865 ms, max = 9.16568 ms, mean = 8.92204 ms, median = 8.9209 ms, percentile(90%) = 8.92993 ms, percentile(95%) = 8.93347 ms, percentile(99%) = 8.9425 ms
[04/10/2025-19:08:33] [I] D2H Latency: min = 0.0664062 ms, max = 0.159454 ms, mean = 0.145687 ms, median = 0.145874 ms, percentile(90%) = 0.14917 ms, percentile(95%) = 0.150146 ms, percentile(99%) = 0.151611 ms
[04/10/2025-19:08:33] [I] Total Host Walltime: 3.03382 s
[04/10/2025-19:08:33] [I] Total GPU Compute Time: 3.02457 s
Only slightly more performant.