Jetson Orin Nano Super Dev Kit Performance

To test the performance of my Orin Nano Dev Kit with the JetPack 6.1.1 update, I tried running the benchmarks listed here: Benchmarks - NVIDIA Jetson AI Lab
But my benchmark results were close to the Jetson Orin Nano (original) results rather than the Super, although I have L4T 36.4.2 and Jetpack 6.1.1, and I am running with the MAXN power mode.

My benchmark results were:

┌───────────────────────────────────────────────────┬──────────────┬───────────────┬──────────────────────┬────────────────────┬────────────────────┬────────────────────┬───────────────┐
│                       model                       │ input_tokens │ output_tokens │     prefill_time     │    prefill_rate    │    decode_time     │    decode_rate     │    memory     │
├───────────────────────────────────────────────────┼──────────────┼───────────────┼──────────────────────┼────────────────────┼────────────────────┼────────────────────┼───────────────┤
│  HF://dusty-nv/Qwen2.5-7B-Instruct-q4f16_ft-MLC   │           19 │           128 │  0.09362168966666667 │ 203.19562173464928 │  8.312818675569554 │ 15.415974259472037 │    1137.40625 │
│  HF://mlc-ai/gemma-2-2b-it-q4f16_1-MLC            │           13 │           107 │  0.10108033366666669 │ 110.55510910207386 │ 4.2831975709152434 │ 25.143537873069416 │ 1350.90234375 │
│  HF://mlc-ai/gemma-2-9b-it-q4f16_1-MLC            │           19 │           112 │   0.3674690300000001 │ 52.607414458243184 │ 11.361786206959943 │   9.90668822595116 │  1765.7421875 │
│  HF://dusty-nv/Phi-3.5-mini-instruct-q4f16_ft-MLC │           17 │           128 │  0.06081533433333333 │ 280.88185846653334 │  5.146252640755906 │ 24.872552661269584 │  995.42578125 │
│  HF://dusty-nv/SmolLM2-135M-Instruct-q4f16_ft-MLC │            7 │           128 │ 0.014524838333333335 │  396.9837310105139 │  0.920403588031496 │  139.0801310427276 │     1107.5625 │
│  HF://dusty-nv/SmolLM2-360M-Instruct-q4f16_ft-MLC │            8 │           108 │ 0.012918514333333334 │  477.5950234813998 │ 0.9676859710808632 │ 111.86486025638118 │ 1141.50390625 │
│  HF://dusty-nv/SmolLM2-1.7B-Instruct-q4f16_ft-MLC │           14 │           128 │ 0.030061954000000002 │  439.2040223981106 │ 2.9376747241154857 │  43.57207748997136 │  1043.3671875 │
│  HF://dusty-nv/Llama-3.1-8B-Instruct-q4f16_ft-MLC │           18 │           128 │  0.09009451900000001 │  203.6137725906588 │  7.849457652913387 │  16.41234380741825 │  1295.6171875 │
│  HF://dusty-nv/Llama-3.2-3B-Instruct-q4f16_ft-MLC │           18 │           128 │ 0.054136466666666674 │  338.6110429267449 │  4.672917605123359 │ 27.394388706154878 │  1196.9140625 │
└───────────────────────────────────────────────────┴──────────────┴───────────────┴──────────────────────┴────────────────────┴────────────────────┴────────────────────┴───────────────┘

The decode rate values are mostly lower than those reported on the benchmark page for the Super version.

I’ll list below my system details:

  • docker image used by the benchmark: dustynv/mlc:0.1.4-r36.4.2

  • jtop info:

System specs:
Platform                                    Serial Number: [s|XX CLICK TO READ XXX]
  Machine: aarch64                           Hardware
  System: Linux                               Model: NVIDIA Jetson Orin Nano Developer Kit
  Distribution: Ubuntu 22.04 Jammy Jellyfish  699-level Part Number: 699-13767-0005-300 R.1
  Release: 5.15.148-tegra                     P-Number: p3767-0005
  Python: 3.10.12                             Module: NVIDIA Jetson Orin Nano (Developer kit)
                                              SoC: tegra234
 Libraries                                    CUDA Arch BIN: 8.7
  CUDA: 12.6.68                               L4T: 36.4.2
  cuDNN: 9.3.0.75                             Jetpack: 6.1 (rev1)
  TensorRT: 10.3.0.30
  VPI: 3.2.4                                 Hostname: ubuntu
  Vulkan: 1.3.204                            Interfaces
  OpenCV: 4.5.4 with CUDA: NO                 wlP1p1s0: 10.20.1.46
                                              docker0: 172.17.0.1
                                              br-2067a88f117f: 172.18.0.1
  • nvpmodel -q output:
NV Power Mode: MAXN
2

Hi @mohamed.alsalti, while some of your results like gemma-2-9b are closer to the Super performance, there seems to be a lot of variance - can you check that your clocks look like this with sudo tegrastats ?

01-05-2025 05:08:11 RAM 2414/7620MB (lfb 21x4MB) SWAP 629/16384MB (cached 73MB) 
CPU [1%@1728,2%@1728,0%@1728,1%@1728,0%@1728,0%@1728] 
EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1019] 
NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 
cpu@50.5C soc2@49.718C soc0@50.656C gpu@49.937C tj@51.031C soc1@51.031C 
VDD_IN 7183mW/7183mW VDD_CPU_GPU_CV 1598mW/1598mW VDD_SOC 2650mW/2650mW

You may want to try running sudo jetson_clocks script to disable DVFS (it will lock the clocks to the max for the given power profile, in this case MAX-N)

Also, we typically use NVME, with the containers and models on NVME (like here)
In the benchmarks, the first prompt of the model is discarded as a warm-up, it should avoid disk I/O issues after that, but just checking.

Also you might want to check the kernel logs (sudo dmesg) to see if there are any errors being reported that may be limiting the board from reaching full performance.

1 Like

This is a pure guess, I don’t work with the packages you are testing with. I did notice that cuda is not active in openCV. If those package rely upon cuda and it is not configured properly it might be falling back to default of cpu only if they need cuda and cannot find it.

1 Like

Hello @dusty_nv, thank you for your answer. Indeed, I noticed a difference in my tegrastats output compared to yours, so I ran jetson_clocks and saw a considerable improvement in the performance of some models.
The new benchmark results are:

┌────────────────────────────────────┬────────────────────┐
│             model_name             │    decode_rate     │
├────────────────────────────────────┼────────────────────┤
│ Llama-3.1-8B-Instruct-q4f16_ft-MLC │ 18.424824685772936 │
│ Llama-3.2-3B-Instruct-q4f16_ft-MLC │ 39.943680612648755 │
│ Qwen2.5-7B-Instruct-q4f16_ft-MLC   │ 16.259322873645942 │
│ gemma-2-2b-it-q4f16_1-MLC          │  34.91145609786777 │
│ gemma-2-9b-it-q4f16_1-MLC          │   9.47551773113478 │
│ Phi-3.5-mini-instruct-q4f16_ft-MLC │ 28.962488445126468 │
│ SmolLM2-1.7B-Instruct-q4f16_ft-MLC │  64.27114472726157 │
└────────────────────────────────────┴────────────────────┘

The most noticeable changes were in Llama and SmolLM models bringing them more in line with Super results. There wasn’t much improvement in Qwen and Phi but I could get a little bump by increasing MAX_NUM_PROMPTS in the benchmark script from 4 to 10.

Thank you for your reply @foxsquirrel1. The script I’m using is running LLMs for benchmarking performance so OpenCV should be irrelevant here. I’ll make sure to try building it with CUDA support though for my computer vision applications.

Thanks for letting me know @mohamed.alsalti that using jetson_clocks improved the performance - I will note it in the directions. I had not been needing it, while some others had been. Due to what you said about increasing the number of prompts, I think it helps with that.

Currently I am going back over the benchmarks to include other APIs as measured from the endpoint level (i.e. including the request over openai protocol/ect). That won’t be raw tokens/sec but will stand-in for real world usage if you are using these like microservices.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.