AI Models That Run on Jetson Orin Nano Super (8GB) — A Practical Guide

AI Models That Run on Jetson Orin Nano Super (8GB) — A Practical Guide

Are you looking to run AI models on your NVIDIA Jetson Orin Nano Super (8GB)? This guide covers tested models across LLMs, VLMs, Speech, and Agent frameworks — all fitting within the 8GB memory budget.


Hardware & Software Setup

  • Board: NVIDIA Jetson Orin Nano (8GB)
  • OS: JetPack 6.x / Ubuntu 22.04 (ships with Python 3.10)

Installing GPU-Accelerated Packages

NVIDIA provides pre-built Python wheels optimized for Jetson via a dedicated PyPI index:

pip install <package> --extra-index-url https://pypi.jetson-ai-lab.io/jp6/cu126

Example — install ONNX Runtime with GPU support:

pip install onnxruntime-gpu --extra-index-url https://pypi.jetson-ai-lab.io/jp6/cu126

Docker Containers

Ready-to-use Docker images (llama.cpp, etc.) are available from NVIDIA-AI-IOT packages. Look for the latest-jetson-orin tag.


Inference Engines

Two open-source inference engines work well on the Orin Nano:

Engine Description Links
llama.cpp Lightweight C++ inference with OpenAI-compatible API. Runs GGUF models. GitHub · Server Docs
TensorRT-Edge-LLM NVIDIA’s high-performance C++ runtime optimized for Jetson and DRIVE. [GitHub]( GitHub - NVIDIA/TensorRT-Edge-LLM: High-performance, light-weight C++ LLM and VLM Inference Software for Physical AI · GitHub) · [Developer Guide] TensorRT Edge-LLM Documentation — TensorRT Edge-LLM

Pre-built llama.cpp Docker container for Orin Nano:

ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin

llama.cpp Model Format & Sizing

By using the llama.cpp with Q4_K GGUF quantization format, you can fit within the Orin Nano’s 8GB memory:

  • LLMs up to ~10B parameters
  • VLMs up to ~4B parameters

For the full list of tested models, see the Jetson AI Lab and the Jetson AI Lab Models page.


Vision-Language Models (VLMs)

These models process both images and text — useful for camera-based applications.

Model Params HuggingFace GGUF
LFM2-VL 1.6B 1.6B LiquidAI/LFM2-VL-1.6B-GGUF
Cosmos Reason 2 2B 2B Kbenkhaled/Cosmos-Reason2-2B-GGUF
Qwen 3 VL 2B 2B ggml-org/Qwen3-VL-2B-Instruct-GGUF
SmolVLM2 2.2B 2.2B ggml-org/SmolVLM2-2.2B-Instruct-GGUF
Granite Vision 3.2 2B 2B bartowski/ibm-granite_granite-vision-3.2-2b-GGUF
Gemma 3 4B VLM 4B bartowski/google_gemma-3-4b-it-GGUF
Qwen 3.5 VL 2B 2B bartowski/Qwen_Qwen3-VL-2B-Instruct-GGUF

Language Models (LLMs)

Model Params HuggingFace GGUF
Gemma 3 1B 1B ggml-org/gemma-3-1b-it-GGUF
Qwen 3 1.7B 1.7B bartowski/Qwen_Qwen3-1.7B-GGUF
Gemma 3 4B 4B ggml-org/gemma-3-4b-it-GGUF
Nemotron-3-Nano-4B 4B nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
Qwen 3 4B 4B bartowski/Qwen_Qwen3-4B-GGUF
Qwen 3 8B 8B bartowski/Qwen_Qwen3-8B-GGUF

:bulb: Many more LLMs in the 1–10B range are available on HuggingFace in GGUF format and will work with llama.cpp on Orin Nano.


Speech Models

Model Type Description Link
Faster Whisper ASR GPU-accelerated speech-to-text (CTranslate2). Models: tiny.en, base.en, small.en GitHub
Moonshine ASR Fast Edge ASR. Models: tiny (27M), base (61M) GitHub
Kokoro TTS TTS Natural-sounding, GPU-accelerated (ONNX). ~82M params GitHub
Piper TTS TTS Lightweight TTS engine GitHub

Agent Frameworks

OpenClaw is an open-source personal AI agent framework. It can use up to ~1GB RAM, and people have run it successfully on Orin Nano.

For lighter alternatives, consider these edge-optimized options:

Agent RAM Language Link
OpenClaw ~1 GB TypeScript openclaw/openclaw
Nanobot ~100 MB Python HKUDS/nanobot
PicoClaw <10 MB Go sipeed/picoclaw

Memory Fit Guide (Orin Nano 8GB)

Model Size Quantization Approx. RAM Fits alongside STT + TTS?
1–2B Q8_0 ~2–3 GB :white_check_mark: Yes, plenty of room
3–4B Q4_K_M ~3–4 GB :white_check_mark: Yes
7–8B Q4_K_M ~5–6 GB :warning: Yes, with NVMe swap

Tip: Start with a smaller model to get your pipeline working, then scale up if needed.


Have questions or want to share your experience? Drop a comment below!

First of all thanks a lot for the guide.

However, I am confused about the compatibility of TensorRT-Edge-LLM on the Orin Nano 8GB. It was listed in the original post as “work well on the Orin Nano”, but

  1. both the GitHub and the Developer Guide links are for TensorRT-LLM, not TensorRT-Edge-LLM,
  2. TensorRT-LLM is not officially supported on Jetson (as evident here and here), and TensorRT-Edge-LLM apparently only officially supports Jetson Thor with JetPack 7.1 - the support on Jetson Orin with JetPack 6.2.x is only experimental (source).

Thus, I’m not sure where “work well” came from, considering it’s not even officially supported, and there are forum posts that say the build process is extremely painful and inference is slower than expected.

Are there any tutorials for using TensorRT-(Edge-)LLM on Jeston Orin Nano Super 8GB or case studies on its performance? I’m trying to maximize the inference speed on my Orin Nano, so any tips would be greatly appreciated.

Hi @zexyg ,

Thanks for the info. I’ve corrected the links in the post to point to the right project. Correct, TensorRT-LLM (the datacenter variant) does not support Jetson. That’s exactly why NVIDIA built TensorRT-Edge-LLM — a separate, purpose-built inference engine specifically for edge devices like Jetson and DRIVE.

JetPack 6.2 compatibility was officially added in TensorRT-Edge-LLM v0.5.0 (roadmap), so it does run on Orin.

That said, the Orin experience is still maturing. All the models listed are more for llama.cpp, maximizing inference speed on Orin Nano.

Regarding the build issues people have reported, TensorRT-Edge-LLM has been improving rapidly with monthly releases (now at v0.6.0, March 2026). It supports a growing list of models including Qwen3/3.5 (Dense + VL + ASR +
TTS), Llama 3.x, Nemotron-3-Nano-4B, Phi-4-multimodal, and InternVL3. The project is becoming much more feasible for edge deployment than it was even a few months ago.

Hope this helps, please fell free to share if any other questions. Apricate your feedback!

Regarding LLMs on this system, the main problem right now is that, by default, the default ubuntu distro that nvidia provides (sdcard.img) eats an absurd amount of memory.

Even disabling zram (systemctl disable nvzramconfig.service; swapoff -a -v), setting up multi-user target (init 3; systemctl set-default multi-user.target) and docker (systemctl stop docker.socket; systemctl stop docker.service; systemctl stop containerd.service), you will have at least ~512MB of memory usage (a barebones debian distro eats less than a third of that)

Docker takes around ~60MB+ of sysram as well (it would be better to run code baremetal instead of containerized)

To be honest, I never understood why NV didn’t used armbian instead of this behemoth; on embedded devices resources ARE limited.

My suggestion for these kind of taks is to build a lightweight minimalist image instead, ie:

// BTW, GNOME -the default desktop- eats over 600MB of sysram.
// If anyone needs wayland/x, just use another desktop environment.

Hi @VSeras , Thanks for your feedback. This project looks super cool.

We have written a Tech blog which explains optimization strategies to help developers maximize performance, efficiency, and capability on Jetson. It has guide to memory optimization at different stacks :
Maximizing Memory Efficiency to Run Bigger Models on NVIDIA Jetson | NVIDIA Technical Blog

The main problem that I see is that the system should had been a pure DEBIAN from the start, all of these things are just workarounds.

Choosing Ubuntu was the worst decision Nvidia could have made, why?

  • Has advertisements even on apt (“hey, do you want to update these packages? pay us/give us telemetry!”), adding three unneeded services just for that:
    • ubuntu-advantage
    • apt-news
    • esm-cache
  • Tries to force you to use snap to install software (¿?)
    • Another memory hog, snapd
  • By default, even thunderbird is installed.
  • “man” has been hijacked by a script (“unminimize”) that not only installs man, but a ton of unneeded extra junk.
  • Uses network manager by default. Nothing more to say about this part 🤣

I was able to decrease memory usage/base footprint to around 256MB from the base install, but it was something harder than it should if the system were something logical from the beginning;

The problem is that all these modifications aren’t a good thing for rookies, and at the same time, basically mandatory to be able to use the system for literally anything.

Sometimes, “less is more”, as Mies van der Rohe said.

// You know, a normal barebones Debian install takes around ~100MB of memory usage (without nv modules) and, the most important thing, boots in just a few seconds. Which is the important part.