I recently performed an inference time estimation for running YOLOv8s on the NVIDIA Jetson Orin NX 16GB, and I noticed significant discrepancies between the theoretical and benchmarked results.
Summary from datasheets
Model (YOLOv8s) Complexity: 29.7 GFLOPs per inference
Jetson Orin NX 16GB:
INT8 Performance: 50 TOPS
FP16 Performance: ~25 TFLOPS
Calculated Inference Time (Theoretical):
Using the formula: Inference Time = (Model GFLOPs) / (GPU TFLOPS)
Result: ≈ 1.1 ms
Actual Benchmark
Inference Time: 7.94 ms
I understand that the theoretical calculation tends to be overly optimistic, but I’m hoping to understand more precisely why such discrepancies occur. Could you provide some insights into where the additional latency originates in real-world usage?
Additionally, do you have any more appropriate tools to estimate these performance metrics or perhaps a white paper that explains how to approach this type of estimation more accurately? It would be very helpful to have guidance or resources on how to correctly factor in real-world variables when calculating inference times.
It would be great to get a clearer picture of the factors affecting this latency gap and how to potentially improve it.
Thanks for your help in understanding these real-world performance challenges!
A small slice of the answer is usually regarding cache. If you are going through the CPU, and if this is a small data set, and if not many other programs are running, then you are likely to get a lot of fast responses from memory due to cache hits. When you start running more programs on that core, or when the data set gets larger, you will suffer more cache misses. This is not hard realtime hardware, and so you won’t ever be able to completely control the cache hit/miss.
Additionally, programming which generates a software interrupt (soft IRQ) can run on any CPU core. Software with a hardware interrupt (hard IRQ) can only run on a CPU core with the IRQ physically wired to it. Much of the hardware involved can only be serviced on CPU0 due to lack of wiring. The more data and programming you have running on CPU0, e.g., from ethernet activity or disk access, the more likely you will get more cache misses. Anything which takes up a high enough load on CPU0 will eventually start impacting all hardware access that uses a hardware IRQ.
You might find that in some cases data and content won’t care so much about CPU0 because it is transferred to the GPU, and although time to transfer in or out of the GPU matters, the GPU itself might be working on a large set of data (once it has that data) somewhat independently of the CPU. Also, tricks like DMA transfers can bypass a need for the CPU core for some of what goes on.
I have no advice on practical ways to improve this other than to make sure CPU0 is running as little as possible during your test (e.g., making sure other programs which don’t need to run are not running at the moment).