At QuanTuring, we build on-premise AI systems for semiconductor manufacturers in Taiwan — air-gapped, compliance-sensitive, no cloud dependency. When we started evaluating inference engines for production deployment, we wanted hard numbers, not marketing claims.
So we ran two controlled experiments on a single RTX 5090:
-
E2: NIM vs Ollama — 100 questions × 3 rounds × 2 engines = 600 data points
-
E3: NIM + NeMo Guardrails overhead — 45 questions × 3 rounds × 2 modes = 270 data points
Same model (Llama 3.1 8B Instruct), same GPU, same questions. Only variable: the inference stack.
E2: NIM vs Ollama — The Speed Gap
Results (600 data points)
| Metric | NIM 1.13.1 | Ollama (llama.cpp) | Speedup |
|---|---|---|---|
| Avg TPS | 73.8 tok/s | 10.1 tok/s | 7.3x |
| Avg TTFT | 221 ms | 2,876 ms | 13x |
| Avg Total Latency | 4.3 s | 27.7 s | 6.5x |
| VRAM | 31,768 MB | 31,772 MB | same |
The VRAM being nearly identical despite different precision (NIM uses BF16, Ollama uses Q4 GGUF) was the first surprise. The second: 13x faster time-to-first-token is not a throughput trick — it’s a fundamentally different user experience.
Why TTFT Matters More Than TPS
Most benchmarks focus on throughput (tokens/second). But for interactive applications — internal chatbots, engineering assistants, document Q&A — TTFT is what users actually feel:
-
Ollama: User submits query → waits 2.9 seconds → first word appears → waits another 25 seconds for full response
-
NIM: User submits query → waits 0.2 seconds → first word appears → 4 seconds for full response
At 221ms TTFT, NIM feels instant. At 2,876ms, Ollama feels like it’s thinking.
Results by Category
| Category | NIM TPS | Ollama TPS | TPS Speedup | NIM TTFT | Ollama TTFT |
|---|---|---|---|---|---|
| factual_short | 63.4 | 9.6 | 6.6x | 225ms | 2,919ms |
| explanation | 80.8 | 11.4 | 7.1x | 224ms | 2,894ms |
| multilingual (EN/ZH/JA/KO) | 79.1 | 9.8 | 8.1x | 229ms | 2,919ms |
| technical | 77.0 | 11.2 | 6.9x | 213ms | 2,978ms |
| rag_simulation | 68.7 | 8.8 | 7.8x | 214ms | 2,668ms |
Key finding: Multilingual queries (English, Chinese, Japanese, Korean) showed the largest TPS gap at 8.1x. NIM’s BF16 full precision handles non-Latin tokenization more efficiently than Ollama’s Q4 quantization. For Asia-Pacific enterprise deployment, this matters.
E3: NeMo Guardrails — Safety Without Sacrifice
The natural follow-up: if you add safety rails to protect that fast inference, how much performance do you give back?
Results (270 data points)
| Metric | Value |
|---|---|
| Clean question overhead | +123 ms (+2.1%) |
| Edge case overhead | +1 ms (~0%) |
| Adversarial detection rate | 93.3% (42/45 blocked) |
| False positive rate | 0% (0/90 clean+edge blocked) |
| Blocked request latency | 94 ms (vs 707 ms without guardrails) |
Per-Category Breakdown
| Category | NIM-only | NIM+Guardrails | Overhead | Result |
|---|---|---|---|---|
| Clean tech questions (20) | 5,765 ms | 5,888 ms | +123ms (+2.1%) | 100% pass |
| Edge cases — security education (10) | 5,958 ms | 5,959 ms | +1ms (~0%) | 100% pass |
| Adversarial inputs (15) | 707 ms | 301 ms | −406ms (−57%) | 93.3% blocked |
The adversarial row is the most interesting: blocked requests respond in 94ms — 7.5x faster than letting the unprotected LLM generate a 700ms refusal. Guardrails don’t just add safety — they save GPU time on adversarial traffic.
The Insight Most People Miss
Each guardrail check is basically a 3-token LLM call (“yes” or “no”). The cost of that call is dominated by TTFT, not token generation.
| Engine | TTFT | Guardrail check cost | Overhead on full response |
|---|---|---|---|
| NIM | 221 ms | ~50 ms | +2.1% (imperceptible) |
| Ollama | 2,876 ms | ~3,000 ms | +21% (unusable) |
On Ollama, adding two guardrail checks (input + output) would add ~6 seconds to every request — most teams would disable safety to preserve usability.
On NIM, the same two checks add ~100ms. NIM doesn’t just make AI faster. It makes enterprise safety practically free.
Self-Check Architecture
User Input
→ [Input Rail] NIM self-check (~50ms, max 3 tokens: "yes"/"no")
→ "yes" = BLOCK (94ms total, no main inference)
→ "no" = ALLOW → NIM inference (~5.8s) → [Output Rail] → Response
The same Llama 3.1 8B model handles both inference AND safety checking. No external API, no cloud dependency.
The Combined Stack
| Stack | Avg Latency | Enterprise Safety | Air-Gap Ready |
|---|---|---|---|
| Ollama (no guardrails) | ~27,700 ms | None | Yes |
| Ollama + Guardrails | ~33,700 ms (+21%) | Yes | Yes |
| NIM (no guardrails) | ~4,300 ms | None | Yes |
| NIM + Guardrails | ~4,400 ms (+2.1%) | Yes | Yes |
NIM is fast enough that guardrails become a rounding error. The +2.1% overhead is imperceptible to users — but the safety guarantee is real. And every blocked adversarial request saves ~5.7 seconds of GPU inference time.
Getting NIM Running on RTX 5090 — Deployment Gotchas
RTX 5090 is a consumer Blackwell GPU (sm_120), and the NIM ecosystem is primarily validated on data center hardware. Here’s the obstacle course:
Issue 1: NIM latest requires CUDA 13.0 RTX 5090 with driver 577.00 supports CUDA 12.9. NIM ≥1.15.0 requires CUDA 13.0 and fails immediately. Solution: use NIM 1.13.1, the last version on the CUDA 12.x requirement.
Issue 2: TensorRT-LLM profile hangs on sm_120 NIM auto-selected a TensorRT-LLM profile for RTX 5090. The container started but froze at 0% CPU, 0MB memory — no error, no progress. Root cause: TRT-LLM profile compilation for sm_120 was not stable. Solution: force the vLLM profile by passing the exact profile hash via NIM_MODEL_PROFILE.
Issue 3: KV cache OOM with default max_model_len NIM defaulted to max_model_len=131072 (128K context), which needs 16GB of KV cache alone. After loading BF16 weights (~16GB), only ~10.6GB remained. Solution: NIM_MAX_MODEL_LEN=8192 cuts KV cache to ~1.5GB while still handling enterprise workloads.
Issue 4: NeMo Guardrails is_content_safe parser inversion The built-in output parser uses inverted yes/no logic. LLM answers “yes” → content is unsafe → BLOCK. If you write a prompt asking “Does this comply with policy?” and the LLM correctly says “yes” — it gets blocked. The fix: write prompts asking about violations, not compliance.
Full deployment notes with exact commands: DEPLOYMENT_NOTES.md
What’s Next
| Experiment | Description | Status |
|---|---|---|
| E2 | NIM vs Ollama inference benchmark | ✅ Complete (600 data points) |
| E3 | NeMo Guardrails latency overhead | ✅ Complete (270 data points) |
| E4 | RAG + NIM + Guardrails full-stack accuracy | Planned |
| E5 | Air-gap mode verification (fully offline) | Planned |
| E6 | Concurrency stress test (1–50 users) | Planned |
Hardware / Software
| Component | Version |
|---|---|
| GPU | NVIDIA RTX 5090 (32GB GDDR7, sm_120 Blackwell) |
| Driver | 577.00 |
| CUDA | 12.8 (PyTorch) / 12.9 (driver) |
| NIM | 1.13.1, vLLM engine, BF16 |
| NeMo Guardrails | 0.21.0 |
| Ollama | llama3.1:8b, llama.cpp, Q4 GGUF |
| Model | Meta Llama 3.1 8B Instruct |
| OS | Windows 11 + Docker Desktop + NVIDIA Container Toolkit |
Open Source
Full benchmark scripts, question datasets, guardrails config, and all 870 raw data points:
👉 https://github.com/QuanTuring-AI/nim-benchmark
QuanTuring Inc. builds enterprise AI middleware for industries where data sovereignty is non-negotiable. NVIDIA Inception Program member. 2 US provisional patents.
Author: Allen Chen — Founder & CEO | LinkedIn