AI Models That Run on Jetson Orin Nano Super (8GB) — A Practical Guide
Are you looking to run AI models on your NVIDIA Jetson Orin Nano Super (8GB)? This guide covers tested models across LLMs, VLMs, Speech, and Agent frameworks — all fitting within the 8GB memory budget.
Hardware & Software Setup
- Board: NVIDIA Jetson Orin Nano (8GB)
- OS: JetPack 6.x / Ubuntu 22.04 (ships with Python 3.10)
Installing GPU-Accelerated Packages
NVIDIA provides pre-built Python wheels optimized for Jetson via a dedicated PyPI index:
pip install <package> --extra-index-url https://pypi.jetson-ai-lab.io/jp6/cu126
Example — install ONNX Runtime with GPU support:
pip install onnxruntime-gpu --extra-index-url https://pypi.jetson-ai-lab.io/jp6/cu126
Docker Containers
Ready-to-use Docker images (llama.cpp, etc.) are available from NVIDIA-AI-IOT packages. Look for the latest-jetson-orin tag.
Inference Engines
Two open-source inference engines work well on the Orin Nano:
| Engine | Description | Links |
|---|---|---|
| llama.cpp | Lightweight C++ inference with OpenAI-compatible API. Runs GGUF models. | GitHub · Server Docs |
| TensorRT-Edge-LLM | NVIDIA’s high-performance C++ runtime optimized for Jetson and DRIVE. | [GitHub]( GitHub - NVIDIA/TensorRT-Edge-LLM: High-performance, light-weight C++ LLM and VLM Inference Software for Physical AI · GitHub) · [Developer Guide] TensorRT Edge-LLM Documentation — TensorRT Edge-LLM |
Pre-built llama.cpp Docker container for Orin Nano:
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin
llama.cpp Model Format & Sizing
By using the llama.cpp with Q4_K GGUF quantization format, you can fit within the Orin Nano’s 8GB memory:
- LLMs up to ~10B parameters
- VLMs up to ~4B parameters
For the full list of tested models, see the Jetson AI Lab and the Jetson AI Lab Models page.
Vision-Language Models (VLMs)
These models process both images and text — useful for camera-based applications.
| Model | Params | HuggingFace GGUF |
|---|---|---|
| LFM2-VL 1.6B | 1.6B | LiquidAI/LFM2-VL-1.6B-GGUF |
| Cosmos Reason 2 2B | 2B | Kbenkhaled/Cosmos-Reason2-2B-GGUF |
| Qwen 3 VL 2B | 2B | ggml-org/Qwen3-VL-2B-Instruct-GGUF |
| SmolVLM2 2.2B | 2.2B | ggml-org/SmolVLM2-2.2B-Instruct-GGUF |
| Granite Vision 3.2 2B | 2B | bartowski/ibm-granite_granite-vision-3.2-2b-GGUF |
| Gemma 3 4B VLM | 4B | bartowski/google_gemma-3-4b-it-GGUF |
| Qwen 3.5 VL 2B | 2B | bartowski/Qwen_Qwen3-VL-2B-Instruct-GGUF |
Language Models (LLMs)
| Model | Params | HuggingFace GGUF |
|---|---|---|
| Gemma 3 1B | 1B | ggml-org/gemma-3-1b-it-GGUF |
| Qwen 3 1.7B | 1.7B | bartowski/Qwen_Qwen3-1.7B-GGUF |
| Gemma 3 4B | 4B | ggml-org/gemma-3-4b-it-GGUF |
| Nemotron-3-Nano-4B | 4B | nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF |
| Qwen 3 4B | 4B | bartowski/Qwen_Qwen3-4B-GGUF |
| Qwen 3 8B | 8B | bartowski/Qwen_Qwen3-8B-GGUF |
:bulb: Many more LLMs in the 1–10B range are available on HuggingFace in GGUF format and will work with llama.cpp on Orin Nano.
Speech Models
| Model | Type | Description | Link |
|---|---|---|---|
| Faster Whisper | ASR | GPU-accelerated speech-to-text (CTranslate2). Models: tiny.en, base.en, small.en |
GitHub |
| Moonshine | ASR | Fast Edge ASR. Models: tiny (27M), base (61M) |
GitHub |
| Kokoro TTS | TTS | Natural-sounding, GPU-accelerated (ONNX). ~82M params | GitHub |
| Piper TTS | TTS | Lightweight TTS engine | GitHub |
Agent Frameworks
OpenClaw is an open-source personal AI agent framework. It can use up to ~1GB RAM, and people have run it successfully on Orin Nano.
For lighter alternatives, consider these edge-optimized options:
| Agent | RAM | Language | Link |
|---|---|---|---|
| OpenClaw | ~1 GB | TypeScript | openclaw/openclaw |
| Nanobot | ~100 MB | Python | HKUDS/nanobot |
| PicoClaw | <10 MB | Go | sipeed/picoclaw |
Memory Fit Guide (Orin Nano 8GB)
| Model Size | Quantization | Approx. RAM | Fits alongside STT + TTS? |
|---|---|---|---|
| 1–2B | Q8_0 | ~2–3 GB | :white_check_mark: Yes, plenty of room |
| 3–4B | Q4_K_M | ~3–4 GB | :white_check_mark: Yes |
| 7–8B | Q4_K_M | ~5–6 GB | :warning: Yes, with NVMe swap |
Tip: Start with a smaller model to get your pipeline working, then scale up if needed.
Have questions or want to share your experience? Drop a comment below!