Llama.cpp on Jetson Orin NX 16GB for API-Only Inference — Bare Metal or NVIDIA Docker?

pontual · February 18, 2026, 9:11pm

I’m currently running two Jetson Orin NX 16GB devices for LLM inference workloads (no training). Until now I’ve been using Ollama bare metal, but based on discussions in the DGX Spark forum, I’m reconsidering my runtime choices — particularly for better performance and lower overhead.

For the Orin NX specifically, I have the following questions:

1. Is llama.cpp currently the recommended inference engine on Orin NX?

My workload is:

Inference only
API-based usage (no interactive UI)
Mostly quantized models (8B range)
No concurrency

If llama.cpp is the preferred direction, are there specific CUDA build flags or optimizations that are considered best practice for Orin NX?

2. Bare metal vs NVIDIA Docker container?

I see NVIDIA provides containerized environments, but in my experience:

Container releases can lag behind upstream versions
Containers often include additional components (interactive tools, dev extras) that increase memory usage

Since I only need a lightweight API server and want to minimize RAM footprint, I’m trying to understand:

Is bare metal compilation of llama.cpp generally preferred on Orin NX?
Or is the NVIDIA container approach better from a CUDA / driver compatibility standpoint?
Are there measurable performance differences between the two?

3. Production API pattern

For those running llama.cpp in production-like scenarios on Orin:

Are you using the built-in HTTP server?
Wrapping it behind another service?
Using CUDA graphs or batching features?

My goal is to keep the runtime minimal and efficient — no UI, no unnecessary services — just an API endpoint with predictable latency and good GPU utilization.

Any guidance from people running inference long-term on Orin NX would be greatly appreciated.

Thanks!

Topic		Replies	Views
Installing llama.cpp Jetson Orin NX cuda , llama	5	267	December 12, 2025
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	26100	May 10, 2024
I'm having trouble deploying VLLM with a mirror in Jetson Orin NX Jetson Orin NX cuda , llm	5	99	January 6, 2026
Performance Issues with LLM model on NVIDIA Jetson Orin NX (16GB) Jetson Orin NX generative_ai	2	1358	June 13, 2024
Issue with Nvidia Jetson AGX Orin Developer Kit (64 Gb) Jetson AGX Orin cuda , generative_ai	5	331	July 30, 2025
Unable to Utilize GPU for LLM on NVIDIA Jetson AGX Orin Jetson AGX Orin generative_ai	4	366	July 4, 2024
Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB) Jetson AGX Orin generative_ai	8	1040	May 12, 2025
Ollama and Jetson issue Jetson Orin NX jetson-inference , generative_ai	12	6145	March 20, 2024
FAQ: Can llama3.2 vision LM be deployed in Jetson Orin Nx 16g Jetson AGX Orin jetson-inference , generative_ai	5	360	November 27, 2024
Problem: slow LLM inference speed on Jetson AGX Orin 64GB Jetson AGX Orin jetson-inference , generative_ai	2	765	April 8, 2025

Llama.cpp on Jetson Orin NX 16GB for API-Only Inference — Bare Metal or NVIDIA Docker?

1. Is llama.cpp currently the recommended inference engine on Orin NX?

2. Bare metal vs NVIDIA Docker container?

3. Production API pattern

Related topics