Spark-inference: Run 3 specialized models simultaneously on your DGX Spark — cybersecurity + coding + orchestration, 30-min setup

victorgilasp · May 6, 2026, 10:33pm

Community release · GB10 / SM12.1 · vLLM + IronClaw + LiteLLM · May 2026

Tested on GB10 SM12.1 2 || 3 models simultaneous Telegram agent Open source ~120GB stack

TL;DR

I spent a week fighting SM12.1 quirks, memory bloat, and broken kernels on my DGX Spark. The result is a tested, reproducible stack that runs 3 specialized models simultaneously in ~120GB, includes a Telegram agent, and sets up in 30 minutes. Everything is open source.

Repo: spark-inference — recipes, scripts, IronClaw agent install, LiteLLM proxy, SM12.1 notes
Built on top of eugr/spark-vllm-docker (prebuilt SM12.1 wheels, ~10-min build)

Benchmark results

The stack supports two modes — you choose based on your needs:

Architecture

Telegram / CLI / OpenWebUI
│
▼
IronClaw agent → LiteLLM proxy (port 4000)
│
┌────────┼────────┐
▼ ▼ ▼
port 8000 port 8001 port 8002
Nemotron-Nano Qwen3.6 Primus/Sec
~32GB ~45GB ~35GB
orchestrator coding security

What’s in the repo

vLLM recipes (memory-optimized for GB10)

Model	Format	RAM	Role
Nemotron-3-Nano-30B NVFP4	NVFP4	~32GB	Orchestrator
Qwen3.6-35B-A3B	FP8	~45GB	Coding + Vision
Llama-Primus-Reasoning (Trend Micro)	BF16	~35GB	Pentest + Reasoning
Foundation-Sec-8B-Instruct (Cisco)	BF16	~35GB	CVE / MITRE ATT&CK / SOC
Nemotron-3-Super-120B	NVFP4	~87GB	Single powerful mode
Qwen3-235B-A22B	FP8	~115GB	Single powerful mode

IronClaw agent (Telegram + CLI)

Reproducible install script for IronClaw on aarch64 — includes all the fixes needed for GB10: PostgreSQL sslmode, DB settings override after onboard, systemd service with correct env vars, Telegram pairing flow. One script, no wizard required after first run.

LiteLLM proxy

Unifies all vLLM endpoints under a single OpenAI-compatible port (4000). IronClaw routes to the right model by name. Idempotent install — re-run to update or add models.

Startup manager + watchdog

start-all.sh handles two things: interactive model selection on demand, and automatic recovery after a reboot.

In interactive mode it shows available memory, lists all recipes with RAM estimates, and lets you pick which models to start — including switching between eager (3 models) and CUDA graphs (1-2 models) mode.

The watchdog runs as a systemd service that activates 5 minutes after boot, checks which models should be running, and starts them automatically. No manual intervention needed after a Spark reboot.

# Interactive — pick models manually
bash scripts/start-all.sh

# Install watchdog (auto-recovery 5 min after reboot)
bash scripts/start-all.sh --install-watchdog

# Uninstall watchdog
bash scripts/start-all.sh --uninstall-watchdog

# Force auto-recovery now (same as watchdog runs)
bash scripts/start-all.sh --auto

Key SM12.1 findings (learned the hard way)

Default vLLM uses 117GB for a 19GB model — 89GB goes to KV cache pre-allocation for concurrent requests you’ll never have. Fix: --gpu-memory-utilization 0.25 drops it to ~32GB. (Credit: sggin1’s memory post)

CUDA graphs need ~130GB+ per model — not feasible with 3 models. Disable with --enforce-eager when running multi-model stacks. Enable --compilation-config for single/dual model mode.

FlashAttn doesn’t support SM12.1 — use VLLM_ATTENTION_BACKEND=FLASHINFER.

CUTLASS FP4 kernels crash on SM12.1 — use Marlin: VLLM_NVFP4_GEMM_BACKEND=marlin.

DeepSeek-R1-0528 full model not usable — MLA attention crashes with all available backends on SM12.1. Use distilled versions (Qwen2 architecture) instead.

FlashInfer MoE throughput backend crashes — use VLLM_FLASHINFER_MOE_BACKEND=latency.

Quick start

Spark Inference Repo: GitHub - VictorGil-Ops/spark-inference: Personal inference stack for NVIDIA DGX Spark GB10 / SM12.1 Blackwell · GitHub

# 1. Build base container (~30 min, one time)
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker && ./build-and-copy.sh

# 2. Clone spark-inference
git clone https://github.com/VictorGil-Ops/spark-inference.git

# 3. Launch models interactively
bash spark-inference/spark.sh

# 4. Install IronClaw + Telegram (optional) and more
menu

Credits

eugr/spark-vllm-docker

The foundation. Prebuilt SM12.1 wheels, custom NCCL build for DGX Spark ring topology, ~10-min build vs 60+ min from source.

sggin1 — Memory Creep post

The definitive analysis of where your 128GB actually goes. The gpu_memory_utilization strategy saved ~85GB per model.

Sggin1/DGX-SPARK

SM12.1 research notes and TurboQuant KV compression research (PR #38479 — promising, not merged yet).

nearai/ironclaw

Rust-based agent framework. No Node.js event loop blocking, PostgreSQL + pgvector memory, WASM sandboxed channels, Telegram polling.

Tested on DGX Spark GB10 · CUDA 13.2 · vLLM main (eugr build) · May 2026
Feedback welcome — especially if you find a way to run CUDA graphs with 3 models simultaneously, or if DeepSeek-R1 MLA gets fixed on SM12.1.

Updates

2026-05-08

The gemma-4-nvfp4 model is added and new tests are carried out to fine-tune weights and speed

github.com/VictorGil-Ops/spark-inference

CHANGELOG.md

master

# Benchmarks

## update 2016-05-08
Prompt: 500 tokens, transformer explanation

| Date | Model | tok/s | RAM | Mode |
|------|-------|-------|-----|------|
| 2026-05-08 00:00 | nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 | 37.8 | ~32GB | Eager |
| 2026-05-08 02:56 | AEON-7/Nemotron-3-Nano-Omni-AEON-Ultimate-Uncensored-NVFP4 | 72.0 | ~32GB | CUDA graphs |
| 2026-05-08 10:17 | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | 62.2 | ~32GB | CUDA graphs |
| 2026-05-08 12:06 | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | 56.2 | ~32GB | Eager |
| 2026-05-08 13:02 | cybermotaz/nemotron3-nano-nvfp4-w4a16 | 43.4 | ~18GB | Eager |
| 2026-05-08 13:18 | Qwen/Qwen3.6-35B-A3B-FP8 | 53.7 | ~64GB | CUDA graphs |
| 2026-05-08 13:34 | Qwen/Qwen3.6-35B-A3B-FP8 | 33.5 | ~64GB | Eager |
| 2026-05-08 14:07 | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | 15.6 | ~91GB | CUDA graphs |
| 2026-05-08 16:19 | nvidia/Gemma-4-26B-A4B-NVFP4 | 29.9 | ~117GB | CUDA graphs |

azampatti · May 7, 2026, 5:06am

@whpthomas Sir, you were talking about potentially loading both Qwen3.6-27B and 35B-A3B… this might be it :)

In conjunction “might” work better than 122b-hybrid… I will have to give this a go sometime during the weekend!

norman.2 · May 7, 2026, 6:21am

Cutlass NVFP4 runs without problems on my spark via vLLM. The crashes seem odd. but nice setup :)

victorgilasp · May 11, 2026, 11:54pm

spark-inference stack update — Mode switching, new recipes & IronClaw

Sharing an update on my local inference stack for the DGX Spark GB10.

Mode switching

spark.sh now has a Switch Mode option to swap between two profiles:

IronClaw mode — models tuned for agentic use (tool calling, reasoning, Telegram agent)
OpenCode mode — models tuned for coding (CUDA graphs, prefix caching, high throughput)

Switching stops the running container (25s graceful timeout), starts the new model, and auto-generates opencode.json with the correct endpoint and context limits. Adding a new model is just dropping a .yaml in recipes/ironclaw/ or recipes/opencode/.

New recipes

IronClaw: Gemma-4 26B multimodal, Nemotron Super 120B (native reasoning), Nemotron Omni 30B (text-only ~32GB and full multimodal ~96GB)

OpenCode: Qwen3.6 35B FP8 with CUDA graphs (196K context), Intel AutoRound Qwen3.5 122B INT4 + bfloat16 KV (~24 tok/s, quality-first for long sessions), Nemotron Super 120B

Recipe improvements

All recipes now include --chat-template-content-format string, correct --tool-call-parser per model, tuned max_num_seqs (8 for single-user, 128 for multi-user), and --language-model-only on Omni models when vision isn’t needed.

Quick parser reference for the GB10: gemma4 for Gemma-4, qwen3_coder for Qwen opencode, qwen3_xml for long-context Qwen agent, hermes for Nemotron, nemotron_v3 for Nemotron reasoning.

IronClaw integration

Full setup via install.sh + setup.sh: LiteLLM proxy on port 4000, llama.cpp embeddings (nomic-embed-text), workspace identity files auto-imported into IronClaw memory, Telegram polling, PostgreSQL with pgvector.

Known issues

--compilation-config JSON syntax varies between launch methods — using flag form --compilation-config.cudagraph_capture_sizes as workaround
AEON-7 Nemotron Omni outputs extra tokens after EOS — workaround: --stop-token-ids 11
Atlas backend integration implemented but not fully tested

Suggestions for recipe optimizations are very welcome — especially around KV cache tuning, CUDA graph sizes, and attention backends. Still actively working on this and bugs are expected.

UPDATE - 2026-05-12

spark-inference update — Agent roles, onboarding, and mode/role integration

Another update on the local inference stack for the DGX Spark GB10. Today’s focus was entirely on the IronClaw agent layer — making it more structured, easier to configure, and smarter about which model to use depending on the task.

Role-based agent system

The agent now has a role system that adapts its identity, behaviour, and default model to different use cases. Five roles are available out of the box:

Role	Default Model	Use case
Personal Assistant	gemma-4	Reminders, notes, news, Telegram, system monitoring
Software Developer	qwen36	Coding, architecture, GitHub, shell
AI/ML Engineer	nemotron-super	Model selection, recipe tuning, inference infra
Security Researcher	foundation-sec	Threat modeling, CVE analysis, code audit
Researcher	nemotron-super	Papers, synthesis, long-context analysis

Each role ships with three files: SOUL.md (values and behaviour), AGENTS.md (model roster and tool access), and HEARTBEAT.md (background tasks). These are loaded into IronClaw’s memory at install or role switch time.

Roles live in ironclaw/roles/ — adding a new one is just creating a directory with those three files.

Onboarding questionnaire

The installer now asks a few questions before starting:

Agent name
Your name and location
Role selection
Preferred language

Answers are written into the workspace .md files and imported into memory automatically. No manual editing needed for a basic setup.

Mode and role integration

Previously, switching inference profiles (ironclaw/opencode) and switching agent roles were two separate operations. They’re now unified under [5] Switch Mode:

When you switch to an inference profile, the system suggests the associated agent role
You can also switch role independently without changing the running model
The menu shows the currently active role in the status bar

Default associations: ironclaw/ → personal-assistant, opencode/ → developer. These are suggestions, not constraints — you can mix any role with any model.

Important note: switching roles or models in the menu only changes what IronClaw sends requests to. It does not load or unload models from GB10 unified memory. Use [2] Models or [5] Switch Mode to manage what’s actually running.

Embedding infrastructure

The agent now knows explicitly about the local embedding server (nomic-embed-text-v1.5 on llama.cpp at port 8010). This is documented in every role’s AGENTS.md so the agent can reason about its own memory infrastructure when asked.

What still needs work

Recipe optimization per role is still largely manual. The current model-to-role mapping is based on general reasoning about each model’s strengths, but hasn’t been systematically validated against real workloads. Key open questions:

Does foundation-sec actually outperform nemotron-super for security tasks on GB10, or is the reasoning capability more important than domain-specific training?
Is qwen36 the right default for the developer role, or does nemotron-super’s native reasoning make it better for architecture-heavy sessions?
How well do the Omni models (with vision) work as personal assistants via Telegram when processing images?
What’s the right max_num_seqs and max_model_len balance for each role’s typical workload?

The goal is to make the role-to-model mapping as flexible as possible — ideally the agent should be able to suggest switching models mid-session based on the task at hand, not just at role switch time.

If you’ve experimented with specific models for specific tasks on GB10, I’d love to hear what’s worked. Still actively developing this and bugs are expected.

Topic		Replies	Views
HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!) DGX Spark / GB10 Projects docker , llama , dgx	39	2852	June 21, 2026
DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference DGX Spark / GB10 Projects docker , spark , llm	9	2091	February 13, 2026
DGX Spark performance DGX Spark / GB10	49	6488	February 13, 2026
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1421	June 20, 2026
Spark: one script CLI for setup, remote access, and LLM serving on DGX Spark DGX Spark / GB10 Projects cuda , docker , spark , llm , deepseek	3	475	May 21, 2026
Managing Local LLM Orchestration DGX Spark / GB10 Projects	12	3102	April 23, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4956	March 6, 2026
Running a Full LLM Stack on DGX Spark GB10 (Your Application -> LiteLLM -> llama-swap -> vLLM / llama.cpp / Ollama) DGX Spark / GB10 Projects spark , jetson , llama , nemotron , openclaw	19	3981	May 28, 2026
Vibe Coding with NVIDIA DGX Spark DGX Spark / GB10	39	5826	May 10, 2026
How are you planning on using your DGX spark? DGX Spark / GB10 Projects	22	3340	February 24, 2026