Hi admin,
The jetson orin nano is the same architecture as the 3060Ti.Whether it’s the number of CUDAs or the clock, the nano is not as good as the 3060Ti, and the FP16’s power is about 17 TFLops compared to the 3060Ti.
My actual testing comparing the nano running yolov8s is not as good as the 3060Ti. The orin nano frame rate is probably half that of the 3060Ti
1 .What’s the cause of this?
2. How is this 17 TFLops calculated from nano, and is it fully available to me if I run a yolov8s model?
3. Nano’s FP16 has 17 TFLops, int8 has 33 TOPS, when I use int8’s model the frame rate can only get 1.5 times of FP16’s frame rate, is this normal?
It’s expected to test it with TensorRT or CUTLASS library.
Please note that you can maximize the Orin Nano’s performance via following command (super mode):
Since TensorRT has optimized the model and changed its architecture, it is hard to know the exact operations used for inference.
But our CUTLASS can output TFLOPS directly. It’s recommended to give it a try.
Usually, we recommended to test this with GEMM directly (CUTLASS).
TFLOPS is the GPU computational performance.
A model usually contains layers that might be affected by memory read/write, memory bandwidth, or CPU performance.
Below is some discussion for CUTLASS for your reference: