DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs

DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs

Hi NVIDIA team and fellow DGX Spark owners,

After about six weeks with DGX Spark, coming from a macOS / non-Linux background, this box genuinely feels like an “iPhone moment” for local AI: a compact, headless powerhouse that runs 80B-class models like Qwen3-Next-80B-A3B-Thinking at usable speeds on a desk, no datacenter required.

The overall experience has been overwhelmingly positive — but moving from “tinkering” to reliable, production-grade deployment has been much harder than it should be. The hardware and core software (vLLM, CUDA, GB10) are excellent. What’s missing is clear, integrated documentation that connects the dots between models, inference stacks, and user interfaces — especially for Qwen3-Next-80B, which is already performing brilliantly but remains invisible or under-documented in official Spark guidance.

Here’s what I’m seeing — and what would make DGX Spark feel truly turnkey for serious users:


Qwen3-Next-80B on Spark

I am running Qwen3-Next-80B-A3B-Thinking (FP8) via vLLM on a single DGX Spark and seeing ~45 tokens/sec sustained on the ShareGPT_V3_unfiltered_cleaned_split.json workload. This is not a toy benchmark — this is production-ready performance for RAG, agents, and tools.[1]

Yet, the official DGX Spark model compatibility table (linked from the docs and from DGX Spark) currently stops at Qwen3-32B and omits Qwen3-Next-80B entirely, even though DGX Spark is positioned for 70B–80B-class models.[2][1]

This creates an unfortunate perception: that Qwen3-Next-80B is “not yet supported” or “experimental.” In reality, it is one of the most capable models you can run locally on Spark today for many workloads.[1]

🔧 Request:

  • Update the DGX Spark model/quantization table to explicitly include:
    • Qwen3-Next-80B-A3B-Thinking-FP8
    • Qwen3-Next-80B-A3B-Instruct-NVFP4[1]
  • Add baseline benchmarks: tokens/sec, context length, container flags, and HF handles (for example: nvidia/Qwen3-Next-80B-A3B-Thinking-NVFP4).[1]
  • Include both vLLM and TensorRT-LLM (where available) throughput/latency numbers.[1]
  • Provide a playbook with an “NVIDIA approved” benchmark for each model for DGX Spark and a performance table similar to Hugging Face.[1]
  • Provide links to more details about each implementation, how to run it, and performance metrics, similar to Hugging Face (for example: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct).[1]

vLLM Playbook and Web UIs

The “vLLM for Inference” playbook is an excellent starting point: pull the container, start the server, run a curl test.[1]

However, it does not mention that vLLM exposes an OpenAI-compatible API, which means it can plug straight into Open WebUI, LangChain, or LlamaIndex with no extra glue code, even though this is how most users expect to interact with a local LLM.[7][1]

docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  -e OPENAI_API_BASE_URL=http://<spark-ip>:8000/v1 \
  ghcr.io/open-webui/open-webui:main

This instantly gives you a full-featured, browser-based chat/RAG UI — with tool calling, memory, multi-model routing, and file uploads.[1]

🔧 Request:

  • Add a short section to the vLLM playbook along the lines of:

“For a full web UI, connect Open WebUI to your vLLM endpoint. See Open WebUI’s ‘Starting with vLLM’ guide for details.”[1]

  • Include a minimal docker run example with host.docker.internal guidance for Docker-on-Spark users.[1]

NIM Containers and Spark Identity

The NGC NIM container for Qwen3-Next-80B-A3B-Thinking is well documented: hybrid MoE, ~3.9B active parameters, OpenAI API, and a clear commercial license.[7][1]

However, there is no DGX Spark identity on those pages: no “Tested on DGX Spark / GB10” badge, no link from the DGX Spark hub at https://build.nvidia.com/spark, and no docker run example tuned to Spark’s single-GPU, headless environment.[2][1]

This creates a split between Spark Playbooks (vLLM, NeMo, fine-tuning — Spark-aware) and NIM / NGC (Qwen3-Next, Llama, Mistral — generic, no Spark mapping), which leaves new users wondering if these containers are actually meant to run on Spark.[2][1]

🔧 Request:

  • Add a “DGX Spark / GB10” box on relevant NIM container pages that includes:
    • “This container is officially supported on DGX Spark.”
    • A tested docker run or docker compose example.
    • A link to a “Self-Host NIM on DGX Spark” playbook.[1]
  • On the DGX Spark hub, add a “NIM on Spark” section linking directly to these validated NIM containers.[2][1], NIM should support Qwen3-Next-80B-A3B-Instruct-NVFP4

TensorRT-LLM, SGLang, and Upgrade Paths

The ~45 tok/s baseline with vLLM is excellent, but the next logical step is TensorRT-LLM using NVIDIA’s Qwen3-Next-80B-A3B-Thinking-NVFP4 variant, which is optimized for Blackwell kernels and FP8/NVFP4 inference.[2][1]

At present there is no Spark-specific TensorRT-LLM example for this model, no playbook showing how to convert NVFP4 weights into a TRT-LLM engine on DGX Spark, and no documentation comparing vLLM vs TensorRT-LLM throughput on Spark.[7][1]. I have been unable to get Qwen3-Next-80B-A3B-Thinking-NVFP4 working.

On A100/H100, TensorRT-LLM often delivers 2–3× speedups; on GB10 a 1.5× boost would already be transformative, yet in practice this NVFP4 setup has been difficult to get running reliably on DGX Spark.[2][1]

🔧 Request:

  • Publish a “TensorRT-LLM on DGX Spark” playbook using Qwen3-Next-80B-NVFP4 as the reference model.[1]
  • Include:
    • Model conversion steps (HF → TRT-LLM engine).
    • Engine build flags (--use_fp8, --max_batch_size, etc.).
    • Throughput/latency comparison vs vLLM on Spark.[1]
  • Optionally, add a SGLang example, since it is mentioned in recipes but never shown in a Spark-centric workflow.[7][1]

Frontends as First-Class Citizens

Open WebUI and Live VLM WebUI are not “nice extras” — they are essential for real-world use of DGX Spark.[1]

  • Open WebUI turns vLLM into a complete chat/RAG UI.[1]
  • Live VLM WebUI turns NIM + VLMs into a benchmarkable, prompt-tunable interface with cloud fallbacks.[1]

Both are documented today, but in separate playbooks and buried under “Try This” sections, while the main DGX Spark hub focuses on “Try NIM APIs” without clearly highlighting end-to-end UI stacks.[2][1]

🔧 Request:

  • Add a “Recommended Stack” section on the main Spark hub, for example:[2][1]

Backend: vLLM / NIM / SGLang
Frontend: Open WebUI / Live VLM WebUI

  • On relevant NIM model pages, explicitly state:

“This model’s OpenAI-compatible API works out-of-the-box with Open WebUI and similar UIs.”[1]

  • Link the Open WebUI + vLLM flow directly from the vLLM playbook so new users see a complete path from “pull container” to “chat in browser.”[7][1]

Summary Table — Throughput & Latency Scaling (Local)

Concurrency Avg Output tok/s Peak Output tok/s Request throughput (req/s) Mean TTFT (ms) P99 TTFT (ms) Benchmark Duration (s) Notes
1 43.16 46.00 0.20 170.53 345.00 997.58 Single-user baseline
4 87.65 100.00 0.41 221.66 407.20 489.39 Good early scaling
8 110.14 129.00 0.51 281.27 542.87 389.35 Strong linear gain
16 136.13 162.00 0.63 398.93 1388.29 315.34 Excellent batching efficiency
32 136.50 177.00 0.64 21870.15 37746.05 313.92 Throughput flat, TTFT spikes
64 136.51 176.00 0.64 61352.88 85486.18 314.92 Plateau reached, extreme TTFT

These numbers show excellent scaling from 1 to 16 concurrent requests, with roughly 3.15× improvement in average tok/s, followed by a plateau where compute becomes the bottleneck and TTFT grows dramatically.[1]


Key Insights & Discussion (Local)

  1. Throughput Scaling Behavior

    • Linear phase (1 → 16 concurrency): Very healthy scaling, from 43 tok/s at 1 concurrent to 136 tok/s at 16, which reflects strong dynamic batching and good utilization of GB10.[2][1]
    • Plateau at 32–64: Output throughput stays around 136 tok/s on average, with peaks near 176–177 tok/s, indicating the classic saturation point where tensor core throughput is the limit rather than batching.[1]
  2. Latency Behavior (TTFT — Time to First Token)

    • Up to concurrency 16, mean TTFT remains under ~400 ms, which feels “instant” for interactive chat and RAG applications.[1]
    • At 32+, mean TTFT jumps into the 21–61 s range with P99 up to ~85 s, which is unusable for real-time chat but still acceptable for offline batch workloads with high throughput.[1]
  3. KV Cache & Resource Usage

    • Logs show KV cache usage staying low (~3–4%) even at high concurrency, so the bottleneck is not cache capacity but generation compute.[1]
    • Prefix caching appears to work effectively, suggesting further gains must come from kernel/engine improvements rather than cache tuning.[1]
  4. Overall Verdict (Local)

    • DGX Spark + vLLM with this FP8 MoE model delivers top-tier performance in the 1–16 concurrency range for a single-node Blackwell with 131k context enabled.[2][1]
    • Beyond ~16–24 concurrency, the system shifts into a batch-processing mode (high throughput, very high latency), ideal for non-interactive bulk jobs but not chat.[1]

Overview of Cloud vs Local Performance

To put DGX Spark performance in context, it helps to compare your local Qwen3-Next-80B-A3B-Thinking FP8 setup against cloud-hosted LLMs from xAI, Perplexity, OpenAI, Anthropic, and Google.[3][4][5][6][1]

The focus here is on output tokens per second (tok/s), time to first token (TTFT), and throughput scaling at different concurrencies, using recent 2025–2026 benchmark reports where available. Cloud speeds can fluctuate 20–50% based on load, while local DGX Spark is more consistent but limited to single-node capacity.[3][2][1]

Key assumptions and notes:

  • Local baseline: Your measurements (43–136 tok/s, TTFT 170–400 ms up to concurrency 16, plateau near 136 tok/s).[1]
  • Cloud tiers: Free/basic tiers tend to have lower rate limits and higher TTFT variance; pro/enterprise tiers offer higher RPM, priority queues, and in some cases dedicated “fast” or “reasoning” modes.[6][3][1]
  • Model equivalence: Qwen3-Next-80B is a MoE with ~3.9B active parameters per token, roughly comparable to 70B–80B dense or MoE-class cloud models (e.g., Grok-4 Fast, Sonar Pro), though architectures differ.[4][3][1]

Cloud vs Local Performance Table

Provider/Model Tier/Type Output Tok/s (Median) TTFT (Median ms) Max Throughput (Tok/s at Concurrency) Notes/Scaling Behavior
Local DGX Spark (Qwen3-Next-80B FP8) N/A (Owned) 43–136 170–400 136 at 16 conc (plateau; TTFT 21s+ at 32+) Excellent for interactive use (1–16 conc); compute-bound beyond 16. No external queuing; full control and stable performance for private workloads. [1]
xAI Grok‑4 Fast (Reasoning) Basic / API ~145–344 (streaming) 2,550–15,000 ~344 at high conc Very fast streaming after TTFT; reasoning mode adds “thinking” time, increasing TTFT. Faster than local in raw throughput; slower for low-conc latency. [3]
xAI Grok‑4 Standard/Heavy Pro / Heavy (~$300/mo) ~44–80 13,580–16,110 ~80–177 at 64 conc Slower than Fast; strong reasoning but high TTFT. Pro tiers improve queuing and limits; comparable to local plateau but with cloud-level scaling. [8]
Perplexity Sonar (70B-based, Pro/Reasoning) Basic (Tier 0–1) ~80–120 (est.) 358–763 ~150–300 at 50–150 conc (RPM-limited) Optimized for search with very low TTFT; basic tiers limited by RPM caps. Scales ~3× at max conc; good balance of speed and quality. [4][5][6]
Perplexity Sonar (Pro / Deep Research) Pro / Enterprise (Tier 3–5) ~100–150 (est.) 358–604 ~600+ at 150+ conc Cerebras-backed Sonar Pro can reach ~1,200 tok/s, giving 4–5× local throughput at high conc; enterprise tiers raise RPM and reduce TTFT variance. [4][5][6]
OpenAI GPT‑5 Mini High (~80B equiv.) Basic API ~100–140 (est.) 2,000–5,000 ~13,000 at 1,000 conc (system-level) High variability in basic tiers; strong scaling at the platform level but per-user limits around 50–100 RPM. Better for bursts than for ultra-low TTFT. [1]
OpenAI GPT‑5 (Full / Pro) Enterprise ~120–200 (est.) 1,000–3,000 ~39,000 at 1,000 conc (with optimizations) Enterprise APIs offer priority queues and tuned endpoints, providing 3×+ DGX Spark throughput at scale but with higher TTFT than local. [1]
Anthropic Claude 4.5 Sonnet Basic / Pro ~60–90 1,500–4,000 ~200–400 at 100 conc Strong reasoning but slower streaming vs Grok/Sonar; pro tiers raise RPM and smooth out TTFT; good for complex reasoning workloads. [1]
Google Gemini 2.5 Pro / Flash Basic / Enterprise ~90–120 800–2,000 ~300–500 at 200 conc Flash modes emphasize speed; enterprise tiers reduce TTFT variability by ~20%. Often competitive with local in TTFT for simple queries. [1]

Values for cloud models combine vendor claims and third-party reports; they fluctuate with global load and configuration.[5][4][6][3]


Cloud vs Local: Detailed Insights

  1. Single-User / Interactive Performance

    • DGX Spark with Qwen3 FP8 shines for interactive workloads, with TTFT under 400 ms up to 16 concurrent users, making responses feel instantaneous for chat and RAG.[1]
    • Cloud options can approach or beat this in some cases (for example, Sonar’s near-“instant” responses), but typically add 1–5 s TTFT in basic tiers and more in reasoning modes like Grok-4 Fast.[5][6][3]
  2. Scaling and Throughput

    • Local DGX Spark plateaus around 136 tok/s on a single node, which is excellent for a small team but insufficient for large-scale public-facing services.[2][1]
    • Cloud infrastructure can scale to thousands of concurrent requests, with Sonar on Cerebras reporting ~1,200 tok/s and major platforms delivering 10k+ tok/s at system level.[4][6][5]
  3. Tier Effects (Free vs Pro/Enterprise)

    • Free/basic tiers often have strict rate limits (for example, ~50 RPM) and less predictable TTFT due to shared queues, which can add 2–5× latency under load.[6][3][1]
    • Pro/enterprise tiers (e.g., xAI SuperGrok, Perplexity Tier 3+, OpenAI/Anthropic enterprise plans) unlock higher RPM, priority routing, and dedicated “fast” modes that significantly reduce variance.[3][6][1]
  4. Overall Verdict (Cloud vs Local)

    • For low-latency, private workloads (1–16 users), local DGX Spark is extremely compelling, often outperforming cloud APIs in TTFT and consistency while keeping data on-prem.[2][1]
    • For high-concurrency or bursty workloads, cloud services — especially in pro and enterprise tiers — win on raw throughput and horizontal scaling, at the cost of higher and more variable TTFT.[4][6][3][1]

Closing Thoughts

The hardware is world-class, the models (especially Qwen3-Next-80B) are cutting-edge, and the software stack (vLLM, NIM, TensorRT-LLM) is powerful.[2][1]

To move DGX Spark from “cool toy” to reliable workstation, the missing piece is cohesive documentation and Spark-aware guidance that makes the end-to-end path obvious:

Gap Solution
❌ Qwen3-Next-80B missing from model tables ✅ Add it with benchmarks (vLLM + TensorRT-LLM)
❌ vLLM playbook stops at curl ✅ Add Open WebUI integration guide
❌ NIM containers have no Spark flag ✅ Add “DGX Spark Ready” badge + Spark-specific docker run example
❌ No TensorRT-LLM or SGLang path ✅ Publish a TensorRT-LLM + SGLang playbook for DGX Spark
❌ Frontends are hidden ✅ Surface Open WebUI / Live VLM WebUI as first-class, recommended UIs on the main Spark hub

The DGX Spark has the potential to redefine local AI for developers, and tightening the docs to match the hardware would go a long way toward making that a reality.[1]


Best regards,

Mark Griffith
DGX Spark Owner, Toronto
@MARKDGRIFFITH


Appendix A — docker-compose.yml (vLLM)

vllm:
  image: nvcr.io/nvidia/vllm:25.12.post1-py3
  container_name: vllm-qwen
  restart: unless-stopped
  runtime: nvidia
  ports:
    - "0.0.0.0:8020:8000"
  volumes:
    - /home/mark/models/Qwen3-Next-80B-A3B-Instruct-FP8:/models/qwen3-next-80b-fp8:ro
    - /home/mark/models/templates/qwen3_chat.jinja:/templates/qwen3_chat.jinja:ro
    - /home/mark/models/test_script/ShareGPT_V3_unfiltered_cleaned_split.json:/data/sharegpt.json:ro
    - /home/mark/.cache/huggingface:/root/.cache/huggingface
  ipc: host
  shm_size: 32g
  command:
    - vllm
    - serve
    - /models/qwen3-next-80b-fp8
    - --dtype
    - auto
    - --max-model-len
    - "131072"
    - --gpu-memory-utilization
    - "0.85"
    - --max-num-seqs
    - "16"
    - --enable-prefix-caching
    - --trust-remote-code
    - --enable-sleep-mode
    - --served-model-name
    - SuperQwen

Appendix B — Dockerfile (Customize Open WebUI)

FROM ghcr.io/open-webui/open-webui:main

# Copy custom assets into the built frontend
COPY ui/custom.css /app/build/static/custom.css
COPY ui/superqwen-modes.js /app/build/static/superqwen-modes.js
COPY ui/model-modes.js /app/build/static/model-modes.js

# Inject CSS (load after built-in styles so it wins)
RUN sed -i 's|</head>|<link rel="stylesheet" href="/static/custom.css">\n<link href="https://fonts.cdnfonts.com/css/argon-2" rel="stylesheet">\n</head>|' /app/build/index.html

# Inject JS (after app bundle, before closing body)
RUN sed -i 's|</body>|<script src="/static/superqwen-modes.js"></script>\n<script src="/static/model-modes.js"></script>\n</body>|' /app/build/index.html

To run the custom WebUI build:

# Stop and remove the old container...
docker stop superqwen-webui 2>/dev/null || true
docker rm superqwen-webui 2>/dev/null || true

# Build the image with --no-cache to force a fresh build
docker build --no-cache -t superqwen-webui .

# Run the new container
docker run -d \
  -p 8050:8080 \
  -e OPENAI_API_BASE_URL="http://192.168.5.100:8020/v1" \
  -e OPENAI_API_KEY="dummy" \
  -v superqwen-webui:/app/backend/data \
  --name superqwen-webui \
  --restart always \
  superqwen-webui

VERY IMPORTANT: inject backend parameters via Docker environment variables; relying only on Open WebUI’s frontend configuration has not been sufficient to consistently bind to the vLLM backend in practice.[1]

Sources

[1] How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks | NVIDIA Technical Blog
[2] What is xAI’s Grok 4 Fast? - Artificial Intelligence Learning What is xAI's Grok 4 Fast? - by Michael Spencer
[3] Perplexity’s LLM: A Technical Deep Dive on Sonar & PPLX Perplexity's LLM: A Technical Deep Dive on Sonar & PPLX | RankStudio
[4] Perplexity: Sonar Free Chat Online - Skywork.ai Perplexity: Sonar Free Chat Online - Skywork ai
[5] Sonar by Perplexity: The Fastest AI Search Model for Accurate, Real … Sonar by Perplexity: The Fastest AI Search Model for Accurate, Real-Time Answers | SaveMyLeads
[6] Qwen3-Next Usage Guide - vLLM Recipes Qwen3-Next Usage Guide - vLLM Recipes
[7] Everything You Need to Know About Grok 4 Everything You Need to Know About Grok 4 - DEV Community
[8] From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f
[9] What token/s are you getting when running Qwen3-Next-80B-A3B … Reddit - The heart of the internet
[10] Performance of llama.cpp on NVIDIA DGX Spark #16578 - GitHub Performance of llama.cpp on NVIDIA DGX Spark · ggml-org/llama.cpp · Discussion #16578 · GitHub
[11] Poor Performance: ~40 t/s for Qwen3-80B-AWQ on Single RTX 6000 … [Bug]: Poor Performance: ~40 t/s for Qwen3-80B-AWQ on Single RTX 6000 · Issue #28667 · vllm-project/vllm · GitHub
[12] Speed Benchmark - Qwen Speed Benchmark - Qwen
[13] Grok 4 Fast - Intelligence, Performance & Price Analysis Grok 4 Fast - Intelligence, Performance & Price Analysis
[14] Compiling VLLM from source on Strix Halo - Framework Desktop [HOW-TO] Compiling VLLM from source on Strix Halo - Framework Desktop - Framework Community
[15] Grok-4 vs Grok-4 Fast Non-Reasoning Grok-4 vs Grok-4 Fast Non-Reasoning

5 Likes

Hi Mark,

Thank you for this incredibly detailed post. We’re ecstatic that you’ve been enjoying your experience with the DGX Spark, and at the same time, have been as excited to improve it. We hear your feedback and are communicating with engineering to integrate changes. Once again, thank you for contributing your feedback to the community!

Hi Mark

i am having some problems with your conf.
i am trying with this yaml

vllm-qwen:

image: nvcr.io/nvidia/vllm:25.12.post1-py3
container_name: vllm-qwen
deploy:
resources:
reservations:
devices:

  • driver: nvidia
    count: all
    capabilities: [gpu]
    environment:
  • VLLM_LOAD_FORMAT=safetensors
  • MAX_JOBS=4
  • VLLM_USE_V1=0
    volumes:
  • /home/gmarconi/models/Qwen3-Next-80B-A3B-Instruct-FP8:/models/qwen3-next-80b-fp8:ro
  • /home/gmarconi/.cache/huggingface:/root/.cache/huggingface
    ipc: host
    shm_size: 32g
    command: >
    vllm serve /models/qwen3-next-80b-fp8
    –dtype auto
    –gpu-memory-utilization 0.85 #tries with 0.80-0.95
    –max-model-len 16384
    –kv-cache-dtype fp8
    –trust-remote-code
    –enforce-eager
    –max-num-seqs 16
    –enable-prefix-caching
    –trust-remote-code
    –enable-sleep-mode
    –served-model-name SuperQwen
    –swap-space 4
    networks:
  • ai-network

but the system hangs for no more ram, I tried to increase swap adding a file of 64gb, but the OOM still happens.
Are you still able to use Qwen3 80b in your 128GB system ?
thanks

Giacomo

Actively using this posted setup of Qwen3‑Next‑80B‑A3B‑Instruct‑FP8 on DGX Spark this configuration has become the primary LLM in the stack and it’s been performing very well for my single‑user workloads. It is great to see others picking up that earlier post and experimenting with the configuration on their own.

Why this config blows up

On newer vLLM versions (v0.11+ and V1 engine variants), --kv-cache-dtype fp8 plus VLLM_USE_V1=0 is a fragile combination and can trigger crashes, illegal memory access, or bad outputs. The fixes are small: stop forcing FP8 KV cache and the legacy engine, and lean on the V1 engine and the Qwen3‑Next FP8/MoE flags instead.[2][3][4][5][6]

What to change (and why)

Feature Your Config Stable Config Why Fix?
--kv-cache-dtype fp8 ❌ Remove Known to cause wrong outputs or crashes with V1 and some backends; default (auto → BF16/FP16 KV) is safer and usually fast enough.[3][4]
VLLM_USE_V1=0 ❌ Remove Disables V1’s optimized engine; current images default to V1, which is where Qwen3‑Next FP8 tuning lives.[6][7]
--enable-chunked-prefill ✅ Add Important for 80B FP8 models; prevents giant prefill allocations on long context.[5][8]
VLLM_USE_FLASHINFER_MOE_FP8=1 ✅ Add Routes MoE layers through optimized FP8 kernels designed for these Qwen3‑Next FP8 variants.[5][9]
VLLM_ATTENTION_BACKEND=FLASH_ATTN ✅ Add Uses FlashAttention‑style kernels for better VRAM efficiency and throughput.[5][8]
VLLM_FLASHINFER_MOE_BACKEND=latency ✅ Add Picks the low‑latency MoE routing backend, which is also memory‑friendly for interactive loads.[5][7]
VLLM_USE_DEEP_GEMM=0 ✅ Add Avoids some experimental GEMM paths that can bloat peak memory usage on certain GPUs.[7][9]
--gpu-memory-utilization 0.85 0.85–0.9 For this model you generally want at least 0.85 to load and have usable KV cache; many Qwen3‑Next FP8 setups land around 0.9.[10][11]
--max-model-len 16384 16384 (131072+ possible) 16k is fine; you can go higher once stable.[5]
--max-num-seqs 16 8-16 8 is a safe starting point; 16 works if VRAM allows and helps throughput.[8]

Final docker‑compose.yml

services:
  vllm:
    image: nvcr.io/nvidia/vllm:25.12.post1-py3
    container_name: vllm-qwen
    restart: unless-stopped
    runtime: nvidia
    ports:
      - "0.0.0.0:8020:8000"
    volumes:
      - /home/gmarconi/models/Qwen3-Next-80B-A3B-Instruct-FP8:/models/qwen3-next-80b-fp8:ro
      - /home/gmarconi/quant_dev_dgx/superqwen/templates/qwen3_chat.jinja:/templates/qwen3_chat.jinja:ro
      - /home/gmarconi/models/test_script/ShareGPT_V3_unfiltered_cleaned_split.json:/data/sharegpt.json:ro
      - /home/gmarconi/.cache/huggingface:/root/.cache/huggingface
    ipc: host
    shm_size: 32g
    environment:
      - VLLM_USE_FLASHINFER_MOE_FP8=1
      - VLLM_FLASHINFER_MOE_BACKEND=latency
      - VLLM_ATTENTION_BACKEND=FLASH_ATTN
      - VLLM_USE_DEEP_GEMM=0
    command:
      - vllm
      - serve
      - /models/qwen3-next-80b-fp8
      - --dtype
      - auto
      - --max-model-len
      - "131072"
      - --gpu-memory-utilization
      - "0.85"
      - --max-num-seqs
      - "16"
      - --enable-prefix-caching
      - --enable-chunked-prefill
      - --enable-sleep-mode
      - --trust-remote-code
      - --enforce-eager
      - --served-model-name
      - SuperQwen
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"

qwen3_chat.jinja prompt template

{% for message in messages %}
{% if message['role'] == 'system' %}
{{ message['content'] }}
{% elif message['role'] == 'user' %}
{% if loop.index0 == 0 %}
<|im_start|>user
{{ message['content'] }}<|im_end|>
{% else %}
<|im_start|>user
{{ message['content'] }}<|im_end|>
{% endif %}
{% elif message['role'] == 'assistant' %}
<|im_start|>assistant
{{ message['content'] }}<|im_end|>
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
<|im_start|>assistant
{% endif %}

Benchmarking with ShareGPT

Good luck getting it all wired up—hopefully this clears the OOMs on your Spark; once it’s stable, it would be great if you can run some benchmarks so we can compare performance between your system and this setup.

For load testing, the ShareGPT V3 “unfiltered_cleaned_split” JSON is handy because it contains a large number of real multi‑turn conversations spanning many tasks (chat, coding, reasoning, casual Q&A), already cleaned and split into manageable chunks. This makes it a decent approximation of mixed real‑world traffic and it’s widely used in LLM training/benchmarking scripts, so it’s easy to compare results across different stacks. You can download the exact file used in the volume mount from:[12][13][14]

https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json

With the container running, you can benchmark from inside it:

# From host: enter the running container
docker exec -it vllm-qwen bash

# Run a 1000-prompt benchmark against SuperQwen
vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8020 \
  --endpoint /v1/chat/completions \
  --model SuperQwen \
  --tokenizer Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --dataset-name sharegpt \
  --dataset-path /data/sharegpt.json \
  --num-prompts 1000 \
  --max-concurrency 16 \
  --request-rate inf \
  --seed 42 \
  --save-result

If you do run this, sharing your throughput/latency numbers would be really interesting to compare against other DGX Spark runs with the same model.

Sources
[1] docker-compose.yml https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/attachments/30233483/d86ced06-852f-49d5-b161-ffa2d19ce4d9/docker-compose.yml
[2] vllm/vllm-openai:v0.11.0 deployment --quantization fp8 … [Bug]: vllm/vllm-openai:v0.11.0 deployment --quantization fp8 throws cuda and tensor errors · Issue #29374 · vllm-project/vllm · GitHub
[3] [Bug]: Enabling fp8 KV cache quantization and prefix … [Bug]: Enabling fp8 KV cache quantization and prefix caching at the same time on Radeon (W7900/RDNA3) crashes the process · Issue #13147 · vllm-project/vllm · GitHub
[4] [Bug]: [V1] wrong output when using kv cache fp8 #13133 [Bug]: [V1] wrong output when using kv cache fp8 · Issue #13133 · vllm-project/vllm · GitHub
[5] Qwen3-Next Usage Guide - vLLM Recipes Qwen3-Next Usage Guide - vLLM Recipes
[6] vLLM V1 User Guide vLLM V1 User Guide — vLLM
[7] Environment Variables - vLLM Environment Variables — vLLM
[8] vllm serve vllm serve - vLLM
[9] NVIDIA Nemotron-3-Nano-30B-A3B User Guide - vLLM Recipes NVIDIA Nemotron-3-Nano-30B-A3B User Guide - vLLM Recipes
[10] Optimization and Tuning - vLLM Optimization and Tuning - vLLM
[11] Qwen/Qwen3-Next-80B-A3B-Instruct · How much GPU … Qwen/Qwen3-Next-80B-A3B-Instruct · How much GPU memory is needed for local deployment?
[12] ShareGPT_V3_unfiltered_cleane… ShareGPT_V3_unfiltered_cleaned_split.json · anon8231489123/ShareGPT_Vicuna_unfiltered at main
[13] anon8231489123/ShareGPT_Vicuna_unfiltered · Datasets … anon8231489123/ShareGPT_Vicuna_unfiltered · Datasets at Hugging Face
[14] benchmarks for vllm, it should support OpenAI Chat … [Feature]: benchmarks for vllm, it should support OpenAI Chat Completions API · Issue #17586 · vllm-project/vllm · GitHub
[15] What does gpu memory utilisation include? - vLLM Forums What does gpu memory utilisation include? - General - vLLM Forums

Hi Mark

thanks for the answer and config, I had to add VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 at your docker-compose.yaml to make it start.

I played with –max-model-len, –gpu-memory-utilization, -max-num-seqs

but the error is always the same:
(EngineCore_DP0 pid=209) ERROR 01-23 11:34:32 [core.py:843] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 0 has a total capacity of 119.64 GiB of which 550.64 MiB is free. Including non-PyTorch memory, this process has 115.02 GiB memory in use. Of the allocated memory 114.76 GiB is allocated by PyTorch, with 48.29 MiB allocated in private pools (e.g., CUDA Graphs), and 1.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( CUDA semantics — PyTorch 2.10 documentation )

and more important:
vllm-qwen | ERROR: This container was built for NVIDIA Driver Release 590.44 or later, but
vllm-qwen | version 580.126.09 was detected and compatibility mode is UNAVAILABLE.

Giacomo

Hi Giacomo,

Sound like you have a driver issue and a vLLM config issue. Lets go step by step and work out what is going on.


1. First: update DGX Spark via the console

Before changing any models, YAML, or Docker settings, please fully update your DGX Spark using the built‑in DGX Dashboard / console update flow. This ensures you are on the current DGX OS, drivers, CUDA, and DGX‑tuned vLLM stack that NVIDIA ships for Spark.

On the DGX console:

  • Go to Settings → Updates.
  • Apply all available OS / firmware / software updates.
  • Reboot the system once everything completes.

(If for some reason the console is not accessible, you can fall back to the standard CLI flow from the docs to update OS and components, but the Dashboard path is preferred on Spark.)

Once you have done that reboot, then let’s move to the vLLM test.


2. Second: run my vLLM docker-compose.yml exactly

After the system is fully updated, please try running exactly the docker-compose.yml from my post, without any edits the first time.

Port mapping note (important):

In my docker-compose.yml I deliberately map vLLM’s internal port 8000 out to 8020 on the host:

ports:
  - "0.0.0.0:8020:8000"

This is non‑standard on purpose: I already run another standalone LLM on the DGX Spark that binds to host port 8000, so I avoid conflicts by exposing vLLM on 8020 instead.

  • If you are not running anything else on port 8000, you can:
    • Either keep my mapping as‑is and call http://<spark-ip>:8020/v1/..., or
    • Change it back to the “default” mapping "0.0.0.0:8000:8000" and use port 8000, matching the standard vLLM examples.

For now, to reproduce my setup as closely as possible, I recommend leaving it at 8020 and just adjusting your curl calls accordingly.

Then run:

docker compose up -d
docker logs -f vllm-qwen

Please let me know:

  • Does the container stay up or crash?
  • Do you see vLLM report that it is listening on 0.0.0.0:8020 inside the container?
  • From another machine (or the host), does this work:
curl http://<spark-ip>:8020/v1/models

If that returns a JSON model list, we have confirmed that your updated Spark stack + my base compose file can serve Qwen3‑Next‑80B correctly.

If this does not work:

docker ps

This should return something like this:

c32d8a7601f2 superqwen-webui “bash start.sh” 38 hours ago Up 28 hours (healthy) 0.0.0.0:8050->8080/tcp, [::]:8050->8080/tcp superqwen-webui

9d957f33691e nvcr.io/nvidia/vllm:25.12.post1-py3 “/opt/nvidia/nvidia_…” 3 days ago Up 28 hours 0.0.0.0:8020->8000/tcp vllm-qwen

You can then check the endpoints for vllm-qwen.


3. After that works:

Once we know the vanilla setup is healthy on your system, what are you trying to layer on top or customize?

Good Luck!

Mark

Is the 590 supported for the DGX Spark - I have updated everything in the DGX Dashboard but the Driver version is still 580. And on the Release Notes I can not find the GB10, DGX Spark..

NOTE:
Do not Update to 590 yet it failed and the GPU driver was gone. Took me an hour to get the spark running again.

I am not sure what is going on here:

I have the following drivers:

Maybe Nvidia support could provide some assistance.

Confirm you are on the correct image:

services:
vllm:
image: nvcr.io/nvidia/vllm:25.12.post1-py3

IN Perlplexity type : nvcr.io/nvidia/vllm: 12.post 1-py3 which drivers does it use ?

It will give you an output like this with clickable links to the technical references

nvcr.io/nvidia/vllm:25.12.post1-py3 is a CUDA 13.1–based container, so it needs a fairly recent 58x‑series driver on the host.
From the vLLM 25.12 release notes:
• vLLM 25.12 containers are built on CUDA 13.1.0.
• CUDA 13.1 requires a R580+ driver; NVIDIA’s support matrix shows CUDA 13.x generally mapping to driver 580.xx or later.
• There is a confirmed setup running nvcr.io/nvidia/vllm:25.12-py3 on driver 580.119.02 (CUDA 13.0) in compatibility mode, which implies that any 580‑series datacenter driver should work for 25.12/25.12.post1.
On DGX Spark specifically, the current DGX OS release ships with a 580‑series driver that is listed as compatible with the vLLM 25.12 container in the Spark vLLM playbook and release notes, so if you’ve updated via the DGX console you’re in the supported range.
In short: that image runs on host drivers in the 580.xx DGX/datacenter branch (or newer), with CUDA 13.x support; anything older than 550 is too old for this container.

Hi Giacomo,

I noticed this in the log files when I was looking at vLLM restarting.

  Using CUDA 13.1 driver version 590.44.01 with kernel driver version 580.126.09.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I think the link may provide the answer regarding driver incompatibility issue.

Good Luck

Mark

Hi Giacoma,

SOLVED :

Actually got this error myself today. Do this:

Your latest docker-vllmqwen.yml is probably syntactically correct

If you are seeing that exact error, Docker is probably running an old container created from the previous, broken file.

Do this to be sure you’re using the new config:

# stop and remove any old vllm containers
docker rm -f vllm-qwen vllm-qwen-old 2>/dev/null || true

# start fresh from docker-vllmqwen.yml
docker compose -f docker-vllmqwen.yml up -d

# watch the logs
docker logs -f vllm-qwen --tail 50

Restart and you should work …

Hopefully this fixes it.

Mark

1 Like