Jetson Orin Nano Super Dev Kit Performance

mohamed.alsalti · January 3, 2025, 2:59pm

To test the performance of my Orin Nano Dev Kit with the JetPack 6.1.1 update, I tried running the benchmarks listed here: Benchmarks - NVIDIA Jetson AI Lab
But my benchmark results were close to the Jetson Orin Nano (original) results rather than the Super, although I have L4T 36.4.2 and Jetpack 6.1.1, and I am running with the MAXN power mode.

My benchmark results were:

┌───────────────────────────────────────────────────┬──────────────┬───────────────┬──────────────────────┬────────────────────┬────────────────────┬────────────────────┬───────────────┐
│                       model                       │ input_tokens │ output_tokens │     prefill_time     │    prefill_rate    │    decode_time     │    decode_rate     │    memory     │
├───────────────────────────────────────────────────┼──────────────┼───────────────┼──────────────────────┼────────────────────┼────────────────────┼────────────────────┼───────────────┤
│  HF://dusty-nv/Qwen2.5-7B-Instruct-q4f16_ft-MLC   │           19 │           128 │  0.09362168966666667 │ 203.19562173464928 │  8.312818675569554 │ 15.415974259472037 │    1137.40625 │
│  HF://mlc-ai/gemma-2-2b-it-q4f16_1-MLC            │           13 │           107 │  0.10108033366666669 │ 110.55510910207386 │ 4.2831975709152434 │ 25.143537873069416 │ 1350.90234375 │
│  HF://mlc-ai/gemma-2-9b-it-q4f16_1-MLC            │           19 │           112 │   0.3674690300000001 │ 52.607414458243184 │ 11.361786206959943 │   9.90668822595116 │  1765.7421875 │
│  HF://dusty-nv/Phi-3.5-mini-instruct-q4f16_ft-MLC │           17 │           128 │  0.06081533433333333 │ 280.88185846653334 │  5.146252640755906 │ 24.872552661269584 │  995.42578125 │
│  HF://dusty-nv/SmolLM2-135M-Instruct-q4f16_ft-MLC │            7 │           128 │ 0.014524838333333335 │  396.9837310105139 │  0.920403588031496 │  139.0801310427276 │     1107.5625 │
│  HF://dusty-nv/SmolLM2-360M-Instruct-q4f16_ft-MLC │            8 │           108 │ 0.012918514333333334 │  477.5950234813998 │ 0.9676859710808632 │ 111.86486025638118 │ 1141.50390625 │
│  HF://dusty-nv/SmolLM2-1.7B-Instruct-q4f16_ft-MLC │           14 │           128 │ 0.030061954000000002 │  439.2040223981106 │ 2.9376747241154857 │  43.57207748997136 │  1043.3671875 │
│  HF://dusty-nv/Llama-3.1-8B-Instruct-q4f16_ft-MLC │           18 │           128 │  0.09009451900000001 │  203.6137725906588 │  7.849457652913387 │  16.41234380741825 │  1295.6171875 │
│  HF://dusty-nv/Llama-3.2-3B-Instruct-q4f16_ft-MLC │           18 │           128 │ 0.054136466666666674 │  338.6110429267449 │  4.672917605123359 │ 27.394388706154878 │  1196.9140625 │
└───────────────────────────────────────────────────┴──────────────┴───────────────┴──────────────────────┴────────────────────┴────────────────────┴────────────────────┴───────────────┘

The decode rate values are mostly lower than those reported on the benchmark page for the Super version.

I’ll list below my system details:

docker image used by the benchmark: dustynv/mlc:0.1.4-r36.4.2
jtop info:

System specs:
Platform                                    Serial Number: [s|XX CLICK TO READ XXX]
  Machine: aarch64                           Hardware
  System: Linux                               Model: NVIDIA Jetson Orin Nano Developer Kit
  Distribution: Ubuntu 22.04 Jammy Jellyfish  699-level Part Number: 699-13767-0005-300 R.1
  Release: 5.15.148-tegra                     P-Number: p3767-0005
  Python: 3.10.12                             Module: NVIDIA Jetson Orin Nano (Developer kit)
                                              SoC: tegra234
 Libraries                                    CUDA Arch BIN: 8.7
  CUDA: 12.6.68                               L4T: 36.4.2
  cuDNN: 9.3.0.75                             Jetpack: 6.1 (rev1)
  TensorRT: 10.3.0.30
  VPI: 3.2.4                                 Hostname: ubuntu
  Vulkan: 1.3.204                            Interfaces
  OpenCV: 4.5.4 with CUDA: NO                 wlP1p1s0: 10.20.1.46
                                              docker0: 172.17.0.1
                                              br-2067a88f117f: 172.18.0.1

nvpmodel -q output:

NV Power Mode: MAXN
2

dusty_nv · January 5, 2025, 1:19pm

Hi @mohamed.alsalti, while some of your results like gemma-2-9b are closer to the Super performance, there seems to be a lot of variance - can you check that your clocks look like this with sudo tegrastats ?

01-05-2025 05:08:11 RAM 2414/7620MB (lfb 21x4MB) SWAP 629/16384MB (cached 73MB) 
CPU [1%@1728,2%@1728,0%@1728,1%@1728,0%@1728,0%@1728] 
EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1019] 
NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 
cpu@50.5C soc2@49.718C soc0@50.656C gpu@49.937C tj@51.031C soc1@51.031C 
VDD_IN 7183mW/7183mW VDD_CPU_GPU_CV 1598mW/1598mW VDD_SOC 2650mW/2650mW

You may want to try running sudo jetson_clocks script to disable DVFS (it will lock the clocks to the max for the given power profile, in this case MAX-N)

Also, we typically use NVME, with the containers and models on NVME (like here)
In the benchmarks, the first prompt of the model is discarded as a warm-up, it should avoid disk I/O issues after that, but just checking.

Also you might want to check the kernel logs (sudo dmesg) to see if there are any errors being reported that may be limiting the board from reaching full performance.

foxsquirrel1 · January 5, 2025, 1:59pm

This is a pure guess, I don’t work with the packages you are testing with. I did notice that cuda is not active in openCV. If those package rely upon cuda and it is not configured properly it might be falling back to default of cpu only if they need cuda and cannot find it.

mohamed.alsalti · January 7, 2025, 9:58am

Hello @dusty_nv, thank you for your answer. Indeed, I noticed a difference in my tegrastats output compared to yours, so I ran jetson_clocks and saw a considerable improvement in the performance of some models.
The new benchmark results are:

┌────────────────────────────────────┬────────────────────┐
│             model_name             │    decode_rate     │
├────────────────────────────────────┼────────────────────┤
│ Llama-3.1-8B-Instruct-q4f16_ft-MLC │ 18.424824685772936 │
│ Llama-3.2-3B-Instruct-q4f16_ft-MLC │ 39.943680612648755 │
│ Qwen2.5-7B-Instruct-q4f16_ft-MLC   │ 16.259322873645942 │
│ gemma-2-2b-it-q4f16_1-MLC          │  34.91145609786777 │
│ gemma-2-9b-it-q4f16_1-MLC          │   9.47551773113478 │
│ Phi-3.5-mini-instruct-q4f16_ft-MLC │ 28.962488445126468 │
│ SmolLM2-1.7B-Instruct-q4f16_ft-MLC │  64.27114472726157 │
└────────────────────────────────────┴────────────────────┘

The most noticeable changes were in Llama and SmolLM models bringing them more in line with Super results. There wasn’t much improvement in Qwen and Phi but I could get a little bump by increasing MAX_NUM_PROMPTS in the benchmark script from 4 to 10.

mohamed.alsalti · January 7, 2025, 10:01am

Thank you for your reply @foxsquirrel1. The script I’m using is running LLMs for benchmarking performance so OpenCV should be irrelevant here. I’ll make sure to try building it with CUDA support though for my computer vision applications.

dusty_nv · January 7, 2025, 2:14pm

Thanks for letting me know @mohamed.alsalti that using jetson_clocks improved the performance - I will note it in the directions. I had not been needing it, while some others had been. Due to what you said about increasing the number of prompts, I think it helps with that.

Currently I am going back over the benchmarks to include other APIs as measured from the endpoint level (i.e. including the request over openai protocol/ect). That won’t be raw tokens/sec but will stand-in for real world usage if you are using these like microservices.

system · January 28, 2025, 4:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson Orin Nano Super performance test issue Jetson Orin Nano jetson	4	405	June 4, 2025
Orin nano benchmark on R36.4.3(jetpack6.2) Jetson Orin Nano benchmarks	14	373	February 26, 2025
Jetson_benchmark Minimum memory requirements Jetson Orin Nano tensorrt , benchmarks	19	1120	November 14, 2023
Pace same on ORIN nano AND nano Jetson Orin Nano jetson-inference	6	690	October 25, 2023
NVIDIA Jetson Orin Nano Developer Kit Gets a “Super” Boost Technical Blog jetson	2	172	December 19, 2024
Jetson Nano Brings AI Computing to Everyone Technical Blog	71	1361	March 13, 2020
Announcing Jetson Orin Nano Jetson AGX Orin	3	1481	February 16, 2023
NVIDIA JetPack 6.2 Brings Super Mode to NVIDIA Jetson Orin Nano and Jetson Orin NX Modules Technical Blog jetson	2	76	June 27, 2025
Why nano run faser than orin nano when i inference cyclegan with pytorch Jetson Orin Nano jetson-inference	3	221	May 15, 2024
Keras MobileNets .h5 model inference on Jetson Nano: GPU is 10x slower than CPU Jetson Nano	3	1588	October 15, 2021

Jetson Orin Nano Super Dev Kit Performance

Related topics