Jetson orin nano fp16/int8 performance

Hi admin,
The jetson orin nano is the same architecture as the 3060Ti.Whether it’s the number of CUDAs or the clock, the nano is not as good as the 3060Ti, and the FP16’s power is about 17 TFLops compared to the 3060Ti.
My actual testing comparing the nano running yolov8s is not as good as the 3060Ti. The orin nano frame rate is probably half that of the 3060Ti

1 .What’s the cause of this?
2. How is this 17 TFLops calculated from nano, and is it fully available to me if I run a yolov8s model?
3. Nano’s FP16 has 17 TFLops, int8 has 33 TOPS, when I use int8’s model the frame rate can only get 1.5 times of FP16’s frame rate, is this normal?


I look forward to getting your answers. Thank you.

Hi,

Could you share how you run the benchmark?

It’s expected to test it with TensorRT or CUTLASS library.
Please note that you can maximize the Orin Nano’s performance via following command (super mode):

$ sudo nvpmodel -m 2
$ sudo jetson_clocks

Thanks.

$ sudo nvpmodel -m 2
$ sudo jetson_clocks
alreay set thses two.
and inference used TensorRT c++.

Hi,

Do you use trtexec to benchmark the inference part only?
Could you share your benchmark code with us or try it with trtexec?
Thanks.

this benchmark with trtexec at jatson orin nano 8G:
nano int8: 157qps


nano FP16 103qps

3060Ti FP16 464qps

Looking at the test data, the nano doesn’t perform at such a high arithmetic as the material states?

Hi,

How do you convert the qps into FP16 TFLOPS?

Since TensorRT has optimized the model and changed its architecture, it is hard to know the exact operations used for inference.
But our CUTLASS can output TFLOPS directly. It’s recommended to give it a try.

Thanks.

I think for the same model, The same model, is the QPS proportionally related to the GPU’s TFLOPs?

GPU fp16/int8 qps TLOPS TFPLOS
Nano 8G FP16 103 17 (ref)
3060Ti FP16 464 16.2 (ref )

I’ll try CUTLASS late and let you know the results .

Hi,

Usually, we recommended to test this with GEMM directly (CUTLASS).

TFLOPS is the GPU computational performance.
A model usually contains layers that might be affected by memory read/write, memory bandwidth, or CPU performance.

Below is some discussion for CUTLASS for your reference:

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.