NIM vs Ollama on RTX 5090: 7.3x Faster Inference + NeMo Guardrails at 2.1% Overhead — 870 Data Points

allenjwchen · March 31, 2026, 2:58pm

At QuanTuring, we build on-premise AI systems for semiconductor manufacturers in Taiwan — air-gapped, compliance-sensitive, no cloud dependency. When we started evaluating inference engines for production deployment, we wanted hard numbers, not marketing claims.

So we ran two controlled experiments on a single RTX 5090:

E2: NIM vs Ollama — 100 questions × 3 rounds × 2 engines = 600 data points
E3: NIM + NeMo Guardrails overhead — 45 questions × 3 rounds × 2 modes = 270 data points

Same model (Llama 3.1 8B Instruct), same GPU, same questions. Only variable: the inference stack.

E2: NIM vs Ollama — The Speed Gap

Results (600 data points)

Metric	NIM 1.13.1	Ollama (llama.cpp)	Speedup
Avg TPS	73.8 tok/s	10.1 tok/s	7.3x
Avg TTFT	221 ms	2,876 ms	13x
Avg Total Latency	4.3 s	27.7 s	6.5x
VRAM	31,768 MB	31,772 MB	same

The VRAM being nearly identical despite different precision (NIM uses BF16, Ollama uses Q4 GGUF) was the first surprise. The second: 13x faster time-to-first-token is not a throughput trick — it’s a fundamentally different user experience.

Why TTFT Matters More Than TPS

Most benchmarks focus on throughput (tokens/second). But for interactive applications — internal chatbots, engineering assistants, document Q&A — TTFT is what users actually feel:

Ollama: User submits query → waits 2.9 seconds → first word appears → waits another 25 seconds for full response
NIM: User submits query → waits 0.2 seconds → first word appears → 4 seconds for full response

At 221ms TTFT, NIM feels instant. At 2,876ms, Ollama feels like it’s thinking.

Results by Category

Category	NIM TPS	Ollama TPS	TPS Speedup	NIM TTFT	Ollama TTFT
factual_short	63.4	9.6	6.6x	225ms	2,919ms
explanation	80.8	11.4	7.1x	224ms	2,894ms
multilingual (EN/ZH/JA/KO)	79.1	9.8	8.1x	229ms	2,919ms
technical	77.0	11.2	6.9x	213ms	2,978ms
rag_simulation	68.7	8.8	7.8x	214ms	2,668ms

Key finding: Multilingual queries (English, Chinese, Japanese, Korean) showed the largest TPS gap at 8.1x. NIM’s BF16 full precision handles non-Latin tokenization more efficiently than Ollama’s Q4 quantization. For Asia-Pacific enterprise deployment, this matters.

E3: NeMo Guardrails — Safety Without Sacrifice

The natural follow-up: if you add safety rails to protect that fast inference, how much performance do you give back?

Results (270 data points)

Metric	Value
Clean question overhead	+123 ms (+2.1%)
Edge case overhead	+1 ms (~0%)
Adversarial detection rate	93.3% (42/45 blocked)
False positive rate	0% (0/90 clean+edge blocked)
Blocked request latency	94 ms (vs 707 ms without guardrails)

Per-Category Breakdown

Category	NIM-only	NIM+Guardrails	Overhead	Result
Clean tech questions (20)	5,765 ms	5,888 ms	+123ms (+2.1%)	100% pass
Edge cases — security education (10)	5,958 ms	5,959 ms	+1ms (~0%)	100% pass
Adversarial inputs (15)	707 ms	301 ms	−406ms (−57%)	93.3% blocked

The adversarial row is the most interesting: blocked requests respond in 94ms — 7.5x faster than letting the unprotected LLM generate a 700ms refusal. Guardrails don’t just add safety — they save GPU time on adversarial traffic.

The Insight Most People Miss

Each guardrail check is basically a 3-token LLM call (“yes” or “no”). The cost of that call is dominated by TTFT, not token generation.

Engine	TTFT	Guardrail check cost	Overhead on full response
NIM	221 ms	~50 ms	+2.1% (imperceptible)
Ollama	2,876 ms	~3,000 ms	+21% (unusable)

On Ollama, adding two guardrail checks (input + output) would add ~6 seconds to every request — most teams would disable safety to preserve usability.

On NIM, the same two checks add ~100ms. NIM doesn’t just make AI faster. It makes enterprise safety practically free.

Self-Check Architecture

User Input
  → [Input Rail] NIM self-check (~50ms, max 3 tokens: "yes"/"no")
    → "yes" = BLOCK (94ms total, no main inference)
    → "no"  = ALLOW → NIM inference (~5.8s) → [Output Rail] → Response

The same Llama 3.1 8B model handles both inference AND safety checking. No external API, no cloud dependency.

The Combined Stack

Stack	Avg Latency	Enterprise Safety	Air-Gap Ready
Ollama (no guardrails)	~27,700 ms	None	Yes
Ollama + Guardrails	~33,700 ms (+21%)	Yes	Yes
NIM (no guardrails)	~4,300 ms	None	Yes
NIM + Guardrails	~4,400 ms (+2.1%)	Yes	Yes

NIM is fast enough that guardrails become a rounding error. The +2.1% overhead is imperceptible to users — but the safety guarantee is real. And every blocked adversarial request saves ~5.7 seconds of GPU inference time.

Getting NIM Running on RTX 5090 — Deployment Gotchas

RTX 5090 is a consumer Blackwell GPU (sm_120), and the NIM ecosystem is primarily validated on data center hardware. Here’s the obstacle course:

Issue 1: NIM latest requires CUDA 13.0 RTX 5090 with driver 577.00 supports CUDA 12.9. NIM ≥1.15.0 requires CUDA 13.0 and fails immediately. Solution: use NIM 1.13.1, the last version on the CUDA 12.x requirement.

Issue 2: TensorRT-LLM profile hangs on sm_120 NIM auto-selected a TensorRT-LLM profile for RTX 5090. The container started but froze at 0% CPU, 0MB memory — no error, no progress. Root cause: TRT-LLM profile compilation for sm_120 was not stable. Solution: force the vLLM profile by passing the exact profile hash via NIM_MODEL_PROFILE.

Issue 3: KV cache OOM with default max_model_len NIM defaulted to max_model_len=131072 (128K context), which needs 16GB of KV cache alone. After loading BF16 weights (~16GB), only ~10.6GB remained. Solution: NIM_MAX_MODEL_LEN=8192 cuts KV cache to ~1.5GB while still handling enterprise workloads.

Issue 4: NeMo Guardrails is_content_safe parser inversion The built-in output parser uses inverted yes/no logic. LLM answers “yes” → content is unsafe → BLOCK. If you write a prompt asking “Does this comply with policy?” and the LLM correctly says “yes” — it gets blocked. The fix: write prompts asking about violations, not compliance.

Full deployment notes with exact commands: DEPLOYMENT_NOTES.md

What’s Next

Experiment	Description	Status
E2	NIM vs Ollama inference benchmark	✅ Complete (600 data points)
E3	NeMo Guardrails latency overhead	✅ Complete (270 data points)
E4	RAG + NIM + Guardrails full-stack accuracy	Planned
E5	Air-gap mode verification (fully offline)	Planned
E6	Concurrency stress test (1–50 users)	Planned

Hardware / Software

Component	Version
GPU	NVIDIA RTX 5090 (32GB GDDR7, sm_120 Blackwell)
Driver	577.00
CUDA	12.8 (PyTorch) / 12.9 (driver)
NIM	1.13.1, vLLM engine, BF16
NeMo Guardrails	0.21.0
Ollama	llama3.1:8b, llama.cpp, Q4 GGUF
Model	Meta Llama 3.1 8B Instruct
OS	Windows 11 + Docker Desktop + NVIDIA Container Toolkit

Open Source

Full benchmark scripts, question datasets, guardrails config, and all 870 raw data points:

👉 https://github.com/QuanTuring-AI/nim-benchmark

QuanTuring Inc. builds enterprise AI middleware for industries where data sovereignty is non-negotiable. NVIDIA Inception Program member. 2 US provisional patents.

Author: Allen Chen — Founder & CEO | LinkedIn

Topic		Replies	Views
NIM HTTP API Inference (Run Anywhere) Taking Extremely Long! Models nim , llama-31-70b-instruct , llama-31-405b-instruct , llama	1	831	September 11, 2024
LLM Performance Benchmarking: Measuring NVIDIA NIM Performance with GenAI-Perf Technical Blog nim , llama	0	161	May 6, 2025
Power Your AI Projects with New NVIDIA NIMs for Mistral and Mixtral Models Technical Blog nim	0	99	July 15, 2024
BPM RED Academy: Human-Centred Health & Performance Digital Twin \| Fine-Tuning on Hyperstack + NIM Validation NIM on RTX AI PCs and Workstations digital-twins , nim , llama3-70b-instruct , llama	0	120	October 9, 2025
vLLM vs NVIDIA NIM Models nim	2	943	January 12, 2026
NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference Technical Blog nim	0	100	November 16, 2024
Rpm increase to 200 request Computer Vision & Image Processing nim , llama , nemotron	0	27	July 3, 2026
Request for NVIDIA NIM API rate limit increase (40 RPM → 200 RPM) NVIDIA NeMo nim	0	125	April 26, 2026
Batch processing using NVIDIA NIM \| Docker \| Self-hosted Models python , nim , llama3-8b-instruct , llama-31-8b-instruct , llama	11	952	January 29, 2025
Securing Generative AI Deployments with NVIDIA NIM and NVIDIA NeMo Guardrails Technical Blog nim	0	125	August 5, 2024