DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs
Hi NVIDIA team and fellow DGX Spark owners,
After about six weeks with DGX Spark, coming from a macOS / non-Linux background, this box genuinely feels like an “iPhone moment” for local AI: a compact, headless powerhouse that runs 80B-class models like Qwen3-Next-80B-A3B-Thinking at usable speeds on a desk, no datacenter required.
The overall experience has been overwhelmingly positive — but moving from “tinkering” to reliable, production-grade deployment has been much harder than it should be. The hardware and core software (vLLM, CUDA, GB10) are excellent. What’s missing is clear, integrated documentation that connects the dots between models, inference stacks, and user interfaces — especially for Qwen3-Next-80B, which is already performing brilliantly but remains invisible or under-documented in official Spark guidance.
Here’s what I’m seeing — and what would make DGX Spark feel truly turnkey for serious users:
Qwen3-Next-80B on Spark
I am running Qwen3-Next-80B-A3B-Thinking (FP8) via vLLM on a single DGX Spark and seeing ~45 tokens/sec sustained on the ShareGPT_V3_unfiltered_cleaned_split.json workload. This is not a toy benchmark — this is production-ready performance for RAG, agents, and tools.[1]
Yet, the official DGX Spark model compatibility table (linked from the docs and from DGX Spark) currently stops at Qwen3-32B and omits Qwen3-Next-80B entirely, even though DGX Spark is positioned for 70B–80B-class models.[2][1]
This creates an unfortunate perception: that Qwen3-Next-80B is “not yet supported” or “experimental.” In reality, it is one of the most capable models you can run locally on Spark today for many workloads.[1]
🔧 Request:
- Update the DGX Spark model/quantization table to explicitly include:
- Qwen3-Next-80B-A3B-Thinking-FP8
- Qwen3-Next-80B-A3B-Instruct-NVFP4[1]
- Add baseline benchmarks: tokens/sec, context length, container flags, and HF handles (for example: nvidia/Qwen3-Next-80B-A3B-Thinking-NVFP4).[1]
- Include both vLLM and TensorRT-LLM (where available) throughput/latency numbers.[1]
- Provide a playbook with an “NVIDIA approved” benchmark for each model for DGX Spark and a performance table similar to Hugging Face.[1]
- Provide links to more details about each implementation, how to run it, and performance metrics, similar to Hugging Face (for example: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct).[1]
vLLM Playbook and Web UIs
The “vLLM for Inference” playbook is an excellent starting point: pull the container, start the server, run a curl test.[1]
However, it does not mention that vLLM exposes an OpenAI-compatible API, which means it can plug straight into Open WebUI, LangChain, or LlamaIndex with no extra glue code, even though this is how most users expect to interact with a local LLM.[7][1]
docker run -d -p 3000:8080 \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
-e OPENAI_API_BASE_URL=http://<spark-ip>:8000/v1 \
ghcr.io/open-webui/open-webui:main
This instantly gives you a full-featured, browser-based chat/RAG UI — with tool calling, memory, multi-model routing, and file uploads.[1]
🔧 Request:
- Add a short section to the vLLM playbook along the lines of:
“For a full web UI, connect Open WebUI to your vLLM endpoint. See Open WebUI’s ‘Starting with vLLM’ guide for details.”[1]
- Include a minimal
docker runexample withhost.docker.internalguidance for Docker-on-Spark users.[1]
NIM Containers and Spark Identity
The NGC NIM container for Qwen3-Next-80B-A3B-Thinking is well documented: hybrid MoE, ~3.9B active parameters, OpenAI API, and a clear commercial license.[7][1]
However, there is no DGX Spark identity on those pages: no “Tested on DGX Spark / GB10” badge, no link from the DGX Spark hub at https://build.nvidia.com/spark, and no docker run example tuned to Spark’s single-GPU, headless environment.[2][1]
This creates a split between Spark Playbooks (vLLM, NeMo, fine-tuning — Spark-aware) and NIM / NGC (Qwen3-Next, Llama, Mistral — generic, no Spark mapping), which leaves new users wondering if these containers are actually meant to run on Spark.[2][1]
🔧 Request:
- Add a “DGX Spark / GB10” box on relevant NIM container pages that includes:
- “This container is officially supported on DGX Spark.”
- A tested
docker runordocker composeexample. - A link to a “Self-Host NIM on DGX Spark” playbook.[1]
- On the DGX Spark hub, add a “NIM on Spark” section linking directly to these validated NIM containers.[2][1], NIM should support Qwen3-Next-80B-A3B-Instruct-NVFP4
TensorRT-LLM, SGLang, and Upgrade Paths
The ~45 tok/s baseline with vLLM is excellent, but the next logical step is TensorRT-LLM using NVIDIA’s Qwen3-Next-80B-A3B-Thinking-NVFP4 variant, which is optimized for Blackwell kernels and FP8/NVFP4 inference.[2][1]
At present there is no Spark-specific TensorRT-LLM example for this model, no playbook showing how to convert NVFP4 weights into a TRT-LLM engine on DGX Spark, and no documentation comparing vLLM vs TensorRT-LLM throughput on Spark.[7][1]. I have been unable to get Qwen3-Next-80B-A3B-Thinking-NVFP4 working.
On A100/H100, TensorRT-LLM often delivers 2–3× speedups; on GB10 a 1.5× boost would already be transformative, yet in practice this NVFP4 setup has been difficult to get running reliably on DGX Spark.[2][1]
🔧 Request:
- Publish a “TensorRT-LLM on DGX Spark” playbook using Qwen3-Next-80B-NVFP4 as the reference model.[1]
- Include:
- Model conversion steps (HF → TRT-LLM engine).
- Engine build flags (
--use_fp8,--max_batch_size, etc.). - Throughput/latency comparison vs vLLM on Spark.[1]
- Optionally, add a SGLang example, since it is mentioned in recipes but never shown in a Spark-centric workflow.[7][1]
Frontends as First-Class Citizens
Open WebUI and Live VLM WebUI are not “nice extras” — they are essential for real-world use of DGX Spark.[1]
- Open WebUI turns vLLM into a complete chat/RAG UI.[1]
- Live VLM WebUI turns NIM + VLMs into a benchmarkable, prompt-tunable interface with cloud fallbacks.[1]
Both are documented today, but in separate playbooks and buried under “Try This” sections, while the main DGX Spark hub focuses on “Try NIM APIs” without clearly highlighting end-to-end UI stacks.[2][1]
🔧 Request:
- Add a “Recommended Stack” section on the main Spark hub, for example:[2][1]
Backend: vLLM / NIM / SGLang
Frontend: Open WebUI / Live VLM WebUI
- On relevant NIM model pages, explicitly state:
“This model’s OpenAI-compatible API works out-of-the-box with Open WebUI and similar UIs.”[1]
- Link the Open WebUI + vLLM flow directly from the vLLM playbook so new users see a complete path from “pull container” to “chat in browser.”[7][1]
Summary Table — Throughput & Latency Scaling (Local)
| Concurrency | Avg Output tok/s | Peak Output tok/s | Request throughput (req/s) | Mean TTFT (ms) | P99 TTFT (ms) | Benchmark Duration (s) | Notes |
|---|---|---|---|---|---|---|---|
| 1 | 43.16 | 46.00 | 0.20 | 170.53 | 345.00 | 997.58 | Single-user baseline |
| 4 | 87.65 | 100.00 | 0.41 | 221.66 | 407.20 | 489.39 | Good early scaling |
| 8 | 110.14 | 129.00 | 0.51 | 281.27 | 542.87 | 389.35 | Strong linear gain |
| 16 | 136.13 | 162.00 | 0.63 | 398.93 | 1388.29 | 315.34 | Excellent batching efficiency |
| 32 | 136.50 | 177.00 | 0.64 | 21870.15 | 37746.05 | 313.92 | Throughput flat, TTFT spikes |
| 64 | 136.51 | 176.00 | 0.64 | 61352.88 | 85486.18 | 314.92 | Plateau reached, extreme TTFT |
These numbers show excellent scaling from 1 to 16 concurrent requests, with roughly 3.15× improvement in average tok/s, followed by a plateau where compute becomes the bottleneck and TTFT grows dramatically.[1]
Key Insights & Discussion (Local)
-
Throughput Scaling Behavior
- Linear phase (1 → 16 concurrency): Very healthy scaling, from 43 tok/s at 1 concurrent to 136 tok/s at 16, which reflects strong dynamic batching and good utilization of GB10.[2][1]
- Plateau at 32–64: Output throughput stays around 136 tok/s on average, with peaks near 176–177 tok/s, indicating the classic saturation point where tensor core throughput is the limit rather than batching.[1]
-
Latency Behavior (TTFT — Time to First Token)
- Up to concurrency 16, mean TTFT remains under ~400 ms, which feels “instant” for interactive chat and RAG applications.[1]
- At 32+, mean TTFT jumps into the 21–61 s range with P99 up to ~85 s, which is unusable for real-time chat but still acceptable for offline batch workloads with high throughput.[1]
-
KV Cache & Resource Usage
- Logs show KV cache usage staying low (~3–4%) even at high concurrency, so the bottleneck is not cache capacity but generation compute.[1]
- Prefix caching appears to work effectively, suggesting further gains must come from kernel/engine improvements rather than cache tuning.[1]
-
Overall Verdict (Local)
- DGX Spark + vLLM with this FP8 MoE model delivers top-tier performance in the 1–16 concurrency range for a single-node Blackwell with 131k context enabled.[2][1]
- Beyond ~16–24 concurrency, the system shifts into a batch-processing mode (high throughput, very high latency), ideal for non-interactive bulk jobs but not chat.[1]
Overview of Cloud vs Local Performance
To put DGX Spark performance in context, it helps to compare your local Qwen3-Next-80B-A3B-Thinking FP8 setup against cloud-hosted LLMs from xAI, Perplexity, OpenAI, Anthropic, and Google.[3][4][5][6][1]
The focus here is on output tokens per second (tok/s), time to first token (TTFT), and throughput scaling at different concurrencies, using recent 2025–2026 benchmark reports where available. Cloud speeds can fluctuate 20–50% based on load, while local DGX Spark is more consistent but limited to single-node capacity.[3][2][1]
Key assumptions and notes:
- Local baseline: Your measurements (43–136 tok/s, TTFT 170–400 ms up to concurrency 16, plateau near 136 tok/s).[1]
- Cloud tiers: Free/basic tiers tend to have lower rate limits and higher TTFT variance; pro/enterprise tiers offer higher RPM, priority queues, and in some cases dedicated “fast” or “reasoning” modes.[6][3][1]
- Model equivalence: Qwen3-Next-80B is a MoE with ~3.9B active parameters per token, roughly comparable to 70B–80B dense or MoE-class cloud models (e.g., Grok-4 Fast, Sonar Pro), though architectures differ.[4][3][1]
Cloud vs Local Performance Table
| Provider/Model | Tier/Type | Output Tok/s (Median) | TTFT (Median ms) | Max Throughput (Tok/s at Concurrency) | Notes/Scaling Behavior |
|---|---|---|---|---|---|
| Local DGX Spark (Qwen3-Next-80B FP8) | N/A (Owned) | 43–136 | 170–400 | 136 at 16 conc (plateau; TTFT 21s+ at 32+) | Excellent for interactive use (1–16 conc); compute-bound beyond 16. No external queuing; full control and stable performance for private workloads. [1] |
| xAI Grok‑4 Fast (Reasoning) | Basic / API | ~145–344 (streaming) | 2,550–15,000 | ~344 at high conc | Very fast streaming after TTFT; reasoning mode adds “thinking” time, increasing TTFT. Faster than local in raw throughput; slower for low-conc latency. [3] |
| xAI Grok‑4 Standard/Heavy | Pro / Heavy (~$300/mo) | ~44–80 | 13,580–16,110 | ~80–177 at 64 conc | Slower than Fast; strong reasoning but high TTFT. Pro tiers improve queuing and limits; comparable to local plateau but with cloud-level scaling. [8] |
| Perplexity Sonar (70B-based, Pro/Reasoning) | Basic (Tier 0–1) | ~80–120 (est.) | 358–763 | ~150–300 at 50–150 conc (RPM-limited) | Optimized for search with very low TTFT; basic tiers limited by RPM caps. Scales ~3× at max conc; good balance of speed and quality. [4][5][6] |
| Perplexity Sonar (Pro / Deep Research) | Pro / Enterprise (Tier 3–5) | ~100–150 (est.) | 358–604 | ~600+ at 150+ conc | Cerebras-backed Sonar Pro can reach ~1,200 tok/s, giving 4–5× local throughput at high conc; enterprise tiers raise RPM and reduce TTFT variance. [4][5][6] |
| OpenAI GPT‑5 Mini High (~80B equiv.) | Basic API | ~100–140 (est.) | 2,000–5,000 | ~13,000 at 1,000 conc (system-level) | High variability in basic tiers; strong scaling at the platform level but per-user limits around 50–100 RPM. Better for bursts than for ultra-low TTFT. [1] |
| OpenAI GPT‑5 (Full / Pro) | Enterprise | ~120–200 (est.) | 1,000–3,000 | ~39,000 at 1,000 conc (with optimizations) | Enterprise APIs offer priority queues and tuned endpoints, providing 3×+ DGX Spark throughput at scale but with higher TTFT than local. [1] |
| Anthropic Claude 4.5 Sonnet | Basic / Pro | ~60–90 | 1,500–4,000 | ~200–400 at 100 conc | Strong reasoning but slower streaming vs Grok/Sonar; pro tiers raise RPM and smooth out TTFT; good for complex reasoning workloads. [1] |
| Google Gemini 2.5 Pro / Flash | Basic / Enterprise | ~90–120 | 800–2,000 | ~300–500 at 200 conc | Flash modes emphasize speed; enterprise tiers reduce TTFT variability by ~20%. Often competitive with local in TTFT for simple queries. [1] |
Values for cloud models combine vendor claims and third-party reports; they fluctuate with global load and configuration.[5][4][6][3]
Cloud vs Local: Detailed Insights
-
Single-User / Interactive Performance
- DGX Spark with Qwen3 FP8 shines for interactive workloads, with TTFT under 400 ms up to 16 concurrent users, making responses feel instantaneous for chat and RAG.[1]
- Cloud options can approach or beat this in some cases (for example, Sonar’s near-“instant” responses), but typically add 1–5 s TTFT in basic tiers and more in reasoning modes like Grok-4 Fast.[5][6][3]
-
Scaling and Throughput
- Local DGX Spark plateaus around 136 tok/s on a single node, which is excellent for a small team but insufficient for large-scale public-facing services.[2][1]
- Cloud infrastructure can scale to thousands of concurrent requests, with Sonar on Cerebras reporting ~1,200 tok/s and major platforms delivering 10k+ tok/s at system level.[4][6][5]
-
Tier Effects (Free vs Pro/Enterprise)
- Free/basic tiers often have strict rate limits (for example, ~50 RPM) and less predictable TTFT due to shared queues, which can add 2–5× latency under load.[6][3][1]
- Pro/enterprise tiers (e.g., xAI SuperGrok, Perplexity Tier 3+, OpenAI/Anthropic enterprise plans) unlock higher RPM, priority routing, and dedicated “fast” modes that significantly reduce variance.[3][6][1]
-
Overall Verdict (Cloud vs Local)
- For low-latency, private workloads (1–16 users), local DGX Spark is extremely compelling, often outperforming cloud APIs in TTFT and consistency while keeping data on-prem.[2][1]
- For high-concurrency or bursty workloads, cloud services — especially in pro and enterprise tiers — win on raw throughput and horizontal scaling, at the cost of higher and more variable TTFT.[4][6][3][1]
Closing Thoughts
The hardware is world-class, the models (especially Qwen3-Next-80B) are cutting-edge, and the software stack (vLLM, NIM, TensorRT-LLM) is powerful.[2][1]
To move DGX Spark from “cool toy” to reliable workstation, the missing piece is cohesive documentation and Spark-aware guidance that makes the end-to-end path obvious:
| Gap | Solution |
|---|---|
| ❌ Qwen3-Next-80B missing from model tables | ✅ Add it with benchmarks (vLLM + TensorRT-LLM) |
❌ vLLM playbook stops at curl |
✅ Add Open WebUI integration guide |
| ❌ NIM containers have no Spark flag | ✅ Add “DGX Spark Ready” badge + Spark-specific docker run example |
| ❌ No TensorRT-LLM or SGLang path | ✅ Publish a TensorRT-LLM + SGLang playbook for DGX Spark |
| ❌ Frontends are hidden | ✅ Surface Open WebUI / Live VLM WebUI as first-class, recommended UIs on the main Spark hub |
The DGX Spark has the potential to redefine local AI for developers, and tightening the docs to match the hardware would go a long way toward making that a reality.[1]
Best regards,
Mark Griffith
DGX Spark Owner, Toronto
@MARKDGRIFFITH
Appendix A — docker-compose.yml (vLLM)
vllm:
image: nvcr.io/nvidia/vllm:25.12.post1-py3
container_name: vllm-qwen
restart: unless-stopped
runtime: nvidia
ports:
- "0.0.0.0:8020:8000"
volumes:
- /home/mark/models/Qwen3-Next-80B-A3B-Instruct-FP8:/models/qwen3-next-80b-fp8:ro
- /home/mark/models/templates/qwen3_chat.jinja:/templates/qwen3_chat.jinja:ro
- /home/mark/models/test_script/ShareGPT_V3_unfiltered_cleaned_split.json:/data/sharegpt.json:ro
- /home/mark/.cache/huggingface:/root/.cache/huggingface
ipc: host
shm_size: 32g
command:
- vllm
- serve
- /models/qwen3-next-80b-fp8
- --dtype
- auto
- --max-model-len
- "131072"
- --gpu-memory-utilization
- "0.85"
- --max-num-seqs
- "16"
- --enable-prefix-caching
- --trust-remote-code
- --enable-sleep-mode
- --served-model-name
- SuperQwen
Appendix B — Dockerfile (Customize Open WebUI)
FROM ghcr.io/open-webui/open-webui:main
# Copy custom assets into the built frontend
COPY ui/custom.css /app/build/static/custom.css
COPY ui/superqwen-modes.js /app/build/static/superqwen-modes.js
COPY ui/model-modes.js /app/build/static/model-modes.js
# Inject CSS (load after built-in styles so it wins)
RUN sed -i 's|</head>|<link rel="stylesheet" href="/static/custom.css">\n<link href="https://fonts.cdnfonts.com/css/argon-2" rel="stylesheet">\n</head>|' /app/build/index.html
# Inject JS (after app bundle, before closing body)
RUN sed -i 's|</body>|<script src="/static/superqwen-modes.js"></script>\n<script src="/static/model-modes.js"></script>\n</body>|' /app/build/index.html
To run the custom WebUI build:
# Stop and remove the old container...
docker stop superqwen-webui 2>/dev/null || true
docker rm superqwen-webui 2>/dev/null || true
# Build the image with --no-cache to force a fresh build
docker build --no-cache -t superqwen-webui .
# Run the new container
docker run -d \
-p 8050:8080 \
-e OPENAI_API_BASE_URL="http://192.168.5.100:8020/v1" \
-e OPENAI_API_KEY="dummy" \
-v superqwen-webui:/app/backend/data \
--name superqwen-webui \
--restart always \
superqwen-webui
VERY IMPORTANT: inject backend parameters via Docker environment variables; relying only on Open WebUI’s frontend configuration has not been sufficient to consistently bind to the vLLM backend in practice.[1]
Sources
[1] How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks | NVIDIA Technical Blog
[2] What is xAI’s Grok 4 Fast? - Artificial Intelligence Learning What is xAI's Grok 4 Fast? - by Michael Spencer
[3] Perplexity’s LLM: A Technical Deep Dive on Sonar & PPLX Perplexity's LLM: A Technical Deep Dive on Sonar & PPLX | RankStudio
[4] Perplexity: Sonar Free Chat Online - Skywork.ai Perplexity: Sonar Free Chat Online - Skywork ai
[5] Sonar by Perplexity: The Fastest AI Search Model for Accurate, Real … Sonar by Perplexity: The Fastest AI Search Model for Accurate, Real-Time Answers | SaveMyLeads
[6] Qwen3-Next Usage Guide - vLLM Recipes Qwen3-Next Usage Guide - vLLM Recipes
[7] Everything You Need to Know About Grok 4 Everything You Need to Know About Grok 4 - DEV Community
[8] From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f
[9] What token/s are you getting when running Qwen3-Next-80B-A3B … Reddit - The heart of the internet
[10] Performance of llama.cpp on NVIDIA DGX Spark #16578 - GitHub Performance of llama.cpp on NVIDIA DGX Spark · ggml-org/llama.cpp · Discussion #16578 · GitHub
[11] Poor Performance: ~40 t/s for Qwen3-80B-AWQ on Single RTX 6000 … [Bug]: Poor Performance: ~40 t/s for Qwen3-80B-AWQ on Single RTX 6000 · Issue #28667 · vllm-project/vllm · GitHub
[12] Speed Benchmark - Qwen Speed Benchmark - Qwen
[13] Grok 4 Fast - Intelligence, Performance & Price Analysis Grok 4 Fast - Intelligence, Performance & Price Analysis
[14] Compiling VLLM from source on Strix Halo - Framework Desktop [HOW-TO] Compiling VLLM from source on Strix Halo - Framework Desktop - Framework Community
[15] Grok-4 vs Grok-4 Fast Non-Reasoning Grok-4 vs Grok-4 Fast Non-Reasoning


