I’m currently running two Jetson Orin NX 16GB devices for LLM inference workloads (no training). Until now I’ve been using Ollama bare metal, but based on discussions in the DGX Spark forum, I’m reconsidering my runtime choices — particularly for better performance and lower overhead.
For the Orin NX specifically, I have the following questions:
1. Is llama.cpp currently the recommended inference engine on Orin NX?
My workload is:
-
Inference only
-
API-based usage (no interactive UI)
-
Mostly quantized models (8B range)
-
No concurrency
If llama.cpp is the preferred direction, are there specific CUDA build flags or optimizations that are considered best practice for Orin NX?
2. Bare metal vs NVIDIA Docker container?
I see NVIDIA provides containerized environments, but in my experience:
-
Container releases can lag behind upstream versions
-
Containers often include additional components (interactive tools, dev extras) that increase memory usage
Since I only need a lightweight API server and want to minimize RAM footprint, I’m trying to understand:
-
Is bare metal compilation of llama.cpp generally preferred on Orin NX?
-
Or is the NVIDIA container approach better from a CUDA / driver compatibility standpoint?
-
Are there measurable performance differences between the two?
3. Production API pattern
For those running llama.cpp in production-like scenarios on Orin:
-
Are you using the built-in HTTP server?
-
Wrapping it behind another service?
-
Using CUDA graphs or batching features?
My goal is to keep the runtime minimal and efficient — no UI, no unnecessary services — just an API endpoint with predictable latency and good GPU utilization.
Any guidance from people running inference long-term on Orin NX would be greatly appreciated.
Thanks!