NIM vs Ollama on RTX 5090: 7.3x Faster Inference + NeMo Guardrails at 2.1% Overhead — 870 Data Points

At QuanTuring, we build on-premise AI systems for semiconductor manufacturers in Taiwan — air-gapped, compliance-sensitive, no cloud dependency. When we started evaluating inference engines for production deployment, we wanted hard numbers, not marketing claims.

So we ran two controlled experiments on a single RTX 5090:

  • E2: NIM vs Ollama — 100 questions × 3 rounds × 2 engines = 600 data points

  • E3: NIM + NeMo Guardrails overhead — 45 questions × 3 rounds × 2 modes = 270 data points

Same model (Llama 3.1 8B Instruct), same GPU, same questions. Only variable: the inference stack.


E2: NIM vs Ollama — The Speed Gap

Results (600 data points)

Metric NIM 1.13.1 Ollama (llama.cpp) Speedup
Avg TPS 73.8 tok/s 10.1 tok/s 7.3x
Avg TTFT 221 ms 2,876 ms 13x
Avg Total Latency 4.3 s 27.7 s 6.5x
VRAM 31,768 MB 31,772 MB same

The VRAM being nearly identical despite different precision (NIM uses BF16, Ollama uses Q4 GGUF) was the first surprise. The second: 13x faster time-to-first-token is not a throughput trick — it’s a fundamentally different user experience.

Why TTFT Matters More Than TPS

Most benchmarks focus on throughput (tokens/second). But for interactive applications — internal chatbots, engineering assistants, document Q&A — TTFT is what users actually feel:

  • Ollama: User submits query → waits 2.9 seconds → first word appears → waits another 25 seconds for full response

  • NIM: User submits query → waits 0.2 seconds → first word appears → 4 seconds for full response

At 221ms TTFT, NIM feels instant. At 2,876ms, Ollama feels like it’s thinking.

Results by Category

Category NIM TPS Ollama TPS TPS Speedup NIM TTFT Ollama TTFT
factual_short 63.4 9.6 6.6x 225ms 2,919ms
explanation 80.8 11.4 7.1x 224ms 2,894ms
multilingual (EN/ZH/JA/KO) 79.1 9.8 8.1x 229ms 2,919ms
technical 77.0 11.2 6.9x 213ms 2,978ms
rag_simulation 68.7 8.8 7.8x 214ms 2,668ms

Key finding: Multilingual queries (English, Chinese, Japanese, Korean) showed the largest TPS gap at 8.1x. NIM’s BF16 full precision handles non-Latin tokenization more efficiently than Ollama’s Q4 quantization. For Asia-Pacific enterprise deployment, this matters.


E3: NeMo Guardrails — Safety Without Sacrifice

The natural follow-up: if you add safety rails to protect that fast inference, how much performance do you give back?

Results (270 data points)

Metric Value
Clean question overhead +123 ms (+2.1%)
Edge case overhead +1 ms (~0%)
Adversarial detection rate 93.3% (42/45 blocked)
False positive rate 0% (0/90 clean+edge blocked)
Blocked request latency 94 ms (vs 707 ms without guardrails)

Per-Category Breakdown

Category NIM-only NIM+Guardrails Overhead Result
Clean tech questions (20) 5,765 ms 5,888 ms +123ms (+2.1%) 100% pass
Edge cases — security education (10) 5,958 ms 5,959 ms +1ms (~0%) 100% pass
Adversarial inputs (15) 707 ms 301 ms −406ms (−57%) 93.3% blocked

The adversarial row is the most interesting: blocked requests respond in 94ms — 7.5x faster than letting the unprotected LLM generate a 700ms refusal. Guardrails don’t just add safety — they save GPU time on adversarial traffic.

The Insight Most People Miss

Each guardrail check is basically a 3-token LLM call (“yes” or “no”). The cost of that call is dominated by TTFT, not token generation.

Engine TTFT Guardrail check cost Overhead on full response
NIM 221 ms ~50 ms +2.1% (imperceptible)
Ollama 2,876 ms ~3,000 ms +21% (unusable)

On Ollama, adding two guardrail checks (input + output) would add ~6 seconds to every request — most teams would disable safety to preserve usability.

On NIM, the same two checks add ~100ms. NIM doesn’t just make AI faster. It makes enterprise safety practically free.

Self-Check Architecture

User Input
  → [Input Rail] NIM self-check (~50ms, max 3 tokens: "yes"/"no")
    → "yes" = BLOCK (94ms total, no main inference)
    → "no"  = ALLOW → NIM inference (~5.8s) → [Output Rail] → Response

The same Llama 3.1 8B model handles both inference AND safety checking. No external API, no cloud dependency.


The Combined Stack

Stack Avg Latency Enterprise Safety Air-Gap Ready
Ollama (no guardrails) ~27,700 ms None Yes
Ollama + Guardrails ~33,700 ms (+21%) Yes Yes
NIM (no guardrails) ~4,300 ms None Yes
NIM + Guardrails ~4,400 ms (+2.1%) Yes Yes

NIM is fast enough that guardrails become a rounding error. The +2.1% overhead is imperceptible to users — but the safety guarantee is real. And every blocked adversarial request saves ~5.7 seconds of GPU inference time.


Getting NIM Running on RTX 5090 — Deployment Gotchas

RTX 5090 is a consumer Blackwell GPU (sm_120), and the NIM ecosystem is primarily validated on data center hardware. Here’s the obstacle course:

Issue 1: NIM latest requires CUDA 13.0 RTX 5090 with driver 577.00 supports CUDA 12.9. NIM ≥1.15.0 requires CUDA 13.0 and fails immediately. Solution: use NIM 1.13.1, the last version on the CUDA 12.x requirement.

Issue 2: TensorRT-LLM profile hangs on sm_120 NIM auto-selected a TensorRT-LLM profile for RTX 5090. The container started but froze at 0% CPU, 0MB memory — no error, no progress. Root cause: TRT-LLM profile compilation for sm_120 was not stable. Solution: force the vLLM profile by passing the exact profile hash via NIM_MODEL_PROFILE.

Issue 3: KV cache OOM with default max_model_len NIM defaulted to max_model_len=131072 (128K context), which needs 16GB of KV cache alone. After loading BF16 weights (~16GB), only ~10.6GB remained. Solution: NIM_MAX_MODEL_LEN=8192 cuts KV cache to ~1.5GB while still handling enterprise workloads.

Issue 4: NeMo Guardrails is_content_safe parser inversion The built-in output parser uses inverted yes/no logic. LLM answers “yes” → content is unsafe → BLOCK. If you write a prompt asking “Does this comply with policy?” and the LLM correctly says “yes” — it gets blocked. The fix: write prompts asking about violations, not compliance.

Full deployment notes with exact commands: DEPLOYMENT_NOTES.md


What’s Next

Experiment Description Status
E2 NIM vs Ollama inference benchmark ✅ Complete (600 data points)
E3 NeMo Guardrails latency overhead ✅ Complete (270 data points)
E4 RAG + NIM + Guardrails full-stack accuracy Planned
E5 Air-gap mode verification (fully offline) Planned
E6 Concurrency stress test (1–50 users) Planned

Hardware / Software

Component Version
GPU NVIDIA RTX 5090 (32GB GDDR7, sm_120 Blackwell)
Driver 577.00
CUDA 12.8 (PyTorch) / 12.9 (driver)
NIM 1.13.1, vLLM engine, BF16
NeMo Guardrails 0.21.0
Ollama llama3.1:8b, llama.cpp, Q4 GGUF
Model Meta Llama 3.1 8B Instruct
OS Windows 11 + Docker Desktop + NVIDIA Container Toolkit

Open Source

Full benchmark scripts, question datasets, guardrails config, and all 870 raw data points:

👉 https://github.com/QuanTuring-AI/nim-benchmark


QuanTuring Inc. builds enterprise AI middleware for industries where data sovereignty is non-negotiable. NVIDIA Inception Program member. 2 US provisional patents.

Author: Allen Chen — Founder & CEO | LinkedIn