Community release · GB10 / SM12.1 · vLLM + IronClaw + LiteLLM · May 2026
Tested on GB10 SM12.1 2 || 3 models simultaneous Telegram agent Open source ~120GB stack
TL;DR
I spent a week fighting SM12.1 quirks, memory bloat, and broken kernels on my DGX Spark. The result is a tested, reproducible stack that runs 3 specialized models simultaneously in ~120GB, includes a Telegram agent, and sets up in 30 minutes. Everything is open source.
Repo: spark-inference — recipes, scripts, IronClaw agent install, LiteLLM proxy, SM12.1 notes
Built on top of eugr/spark-vllm-docker (prebuilt SM12.1 wheels, ~10-min build)
Benchmark results
The stack supports two modes — you choose based on your needs:
Architecture
Telegram / CLI / OpenWebUI
│
▼
IronClaw agent → LiteLLM proxy (port 4000)
│
┌────────┼────────┐
▼ ▼ ▼
port 8000 port 8001 port 8002
Nemotron-Nano Qwen3.6 Primus/Sec
~32GB ~45GB ~35GB
orchestrator coding security
What’s in the repo
vLLM recipes (memory-optimized for GB10)
| Model | Format | RAM | Role |
|---|---|---|---|
| Nemotron-3-Nano-30B NVFP4 | NVFP4 | ~32GB | Orchestrator |
| Qwen3.6-35B-A3B | FP8 | ~45GB | Coding + Vision |
| Llama-Primus-Reasoning (Trend Micro) | BF16 | ~35GB | Pentest + Reasoning |
| Foundation-Sec-8B-Instruct (Cisco) | BF16 | ~35GB | CVE / MITRE ATT&CK / SOC |
| Nemotron-3-Super-120B | NVFP4 | ~87GB | Single powerful mode |
| Qwen3-235B-A22B | FP8 | ~115GB | Single powerful mode |
IronClaw agent (Telegram + CLI)
Reproducible install script for IronClaw on aarch64 — includes all the fixes needed for GB10: PostgreSQL sslmode, DB settings override after onboard, systemd service with correct env vars, Telegram pairing flow. One script, no wizard required after first run.
LiteLLM proxy
Unifies all vLLM endpoints under a single OpenAI-compatible port (4000). IronClaw routes to the right model by name. Idempotent install — re-run to update or add models.
Startup manager + watchdog
start-all.sh handles two things: interactive model selection on demand, and automatic recovery after a reboot.
In interactive mode it shows available memory, lists all recipes with RAM estimates, and lets you pick which models to start — including switching between eager (3 models) and CUDA graphs (1-2 models) mode.
The watchdog runs as a systemd service that activates 5 minutes after boot, checks which models should be running, and starts them automatically. No manual intervention needed after a Spark reboot.
# Interactive — pick models manually
bash scripts/start-all.sh
# Install watchdog (auto-recovery 5 min after reboot)
bash scripts/start-all.sh --install-watchdog
# Uninstall watchdog
bash scripts/start-all.sh --uninstall-watchdog
# Force auto-recovery now (same as watchdog runs)
bash scripts/start-all.sh --auto
Key SM12.1 findings (learned the hard way)
Default vLLM uses 117GB for a 19GB model — 89GB goes to KV cache pre-allocation for concurrent requests you’ll never have. Fix: --gpu-memory-utilization 0.25 drops it to ~32GB. (Credit: sggin1’s memory post)
CUDA graphs need ~130GB+ per model — not feasible with 3 models. Disable with --enforce-eager when running multi-model stacks. Enable --compilation-config for single/dual model mode.
FlashAttn doesn’t support SM12.1 — use VLLM_ATTENTION_BACKEND=FLASHINFER.
CUTLASS FP4 kernels crash on SM12.1 — use Marlin: VLLM_NVFP4_GEMM_BACKEND=marlin.
DeepSeek-R1-0528 full model not usable — MLA attention crashes with all available backends on SM12.1. Use distilled versions (Qwen2 architecture) instead.
FlashInfer MoE throughput backend crashes — use VLLM_FLASHINFER_MOE_BACKEND=latency.
Quick start
Spark Inference Repo: GitHub - VictorGil-Ops/spark-inference: Personal inference stack for NVIDIA DGX Spark GB10 / SM12.1 Blackwell · GitHub
# 1. Build base container (~30 min, one time)
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker && ./build-and-copy.sh
# 2. Clone spark-inference
git clone https://github.com/VictorGil-Ops/spark-inference.git
# 3. Launch models interactively
bash spark-inference/spark.sh
# 4. Install IronClaw + Telegram (optional) and more
menu
Credits
The foundation. Prebuilt SM12.1 wheels, custom NCCL build for DGX Spark ring topology, ~10-min build vs 60+ min from source.
The definitive analysis of where your 128GB actually goes. The gpu_memory_utilization strategy saved ~85GB per model.
SM12.1 research notes and TurboQuant KV compression research (PR #38479 — promising, not merged yet).
Rust-based agent framework. No Node.js event loop blocking, PostgreSQL + pgvector memory, WASM sandboxed channels, Telegram polling.
Tested on DGX Spark GB10 · CUDA 13.2 · vLLM main (eugr build) · May 2026
Feedback welcome — especially if you find a way to run CUDA graphs with 3 models simultaneously, or if DeepSeek-R1 MLA gets fixed on SM12.1.
Updates
- 2026-05-08
The gemma-4-nvfp4 model is added and new tests are carried out to fine-tune weights and speed







