HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!)

jd36 · February 6, 2026, 7:11am

Hi everyone,

Like many of you, I was incredibly excited to get my hands on the DGX Spark (GB10), but that excitement quickly turned into frustration when I realized how much time I was spending just on the “plumbing” - getting the drivers to play nice, configuring the container runtime for the architecture, and wrestling with multi-model handling.

I realized we are all probably reinventing the wheel in our own silos.

So, I decided to open-source my internal stack. The goal is simple: Turn the DGX Spark setup from a weeks-long project into a 30-minutes task.

I’ve published the initial version here:
👉 https://github.com/jdaln/dgx-spark-inference-stack

What’s in the box right now:

“Production”-Ready Inference: A pre-configured Docker Compose stack for serving large models (e.g., OSS, Llama 3, Qwen) using optimized vLLM without the headache of manual flag tuning. And I don’t mount IPC! Just like the official guides say.
Observability: Built-in monitoring for memory usage and optimize later, because we all know how hot these Blackwell chips can run under load.

Why I’m posting this:
I want this to be the “community” starter kit so we can focus on building apps, not debugging drivers and models. I’m looking for contributors to help testing, adding models and to add improvements (some possible suggestions in the TODO.md).

If you’re tired of the setup grind, give it a spin and let me know what breaks. PRs are very welcome!

Let’s make the DGX Spark actually usable for everyone here. 🚀

Cheers,
jd36

paulsc.liu · February 6, 2026, 8:40am

One feedback. A lot of the current setup are aimming for chatbot/programming type of inference. Stable Diffusion setup might be a little bit different. My opinion is that it might be safer to make that a separate project. The major issue is the server: vLLM and Sglang. They usualy have a separate branch (usually call omni) to deal with Stable Diffusion. Omni branch might not keep up with the development of the main branch.

jd36 · February 6, 2026, 11:02am

Thank you for the feedback @paulsc.liu ! I can see that you read my comment in DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference - #10 by jd36 .

The reasoning there was that this line: dgx-spark-inference-stack/compose/models-gpt.yml at 37385ff9766ae78f8da6022df7b85de0fb2f3d27 · jdaln/dgx-spark-inference-stack · GitHub could be any docker image that runs inference really. Currently, depending on the model, the vllm image is different and there are 3-4 different ones from memory.

Of course, when having multimedia inference, there will be a change in the waker’s monitored prefix, which is currently set to vllm-. I could become inference- for instance. If no one gets ahead of me (it would be great if someone does), I’ll start exploring this in a month or 2 because this is a free-time project.

eugr · February 6, 2026, 8:43pm

It has not been the case for a long time now: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub
I believe most of the people on this forum are already using it and it has already become our community starter kit.

I’m not trying to be critical, don’t get me wrong, just curious what’s different with your setup?

jd36 · February 6, 2026, 9:50pm

Hi @eugr ! I think your repo is more oriented towards people working directly on their spark or did I miss something? The goal of what I made is that you curl or otherwise call the model and it loads on demand and switches off after some time if not in use.

eugr · February 6, 2026, 10:05pm

No, it just provides a tested/optimized way to run any vLLM supported model on Spark - either standalone or cluster.

For instance, there are other solutions, well tested and maintained for model switching/proxying, e.g. llama-swap (model loading/switching on demand) and litellm - proxy with fallbacks, etc.

My personal stack is llama-swap sitting on Spark and providing model loading/switching between a mix of vLLM and llama.cpp models, launching both on a single spark and in the cluster. I group models by size, so I can have three models running at the same time (if needed) - one large running on a cluster (e.g. minimax-m2.1 via vLLM), one medium-sized (qwen3-vl-8b in q8 via llama.cpp) and one embedding model (qwen-embedding-8b, currently via llama.cpp, but will probably switch to vllm).

And I also have LiteLLM that I use as my main endpoint for all clients that routes requests to one of my servers (not just Spark cluster) with fallback, etc. It also serves cloud models (Claude, ChatGPT).

I guess, there is a value to provide a “one-click” type integration that can set up llama-swap and our community docker (and llama.cpp) together without reinventing both.

jd36 · February 6, 2026, 10:14pm

I see, thanks for the pointers! I have to look into it when I get time to seat down. Have you been able to add test more models than you posted in spark-vllm-docker/recipes at main · eugr/spark-vllm-docker · GitHub ? The problem, when you have a single spark is that you still need a small model alongside.

Also, how would you see a possible “one-click” type integration that can set up llama-swap and your repo? Does it run a script or what would be the entry point?

eugr · February 6, 2026, 10:21pm

Yes, me and @raphael.amorim are working on it. You are welcome to join!

llama-swap is just one self-contained binary, so it can run on a host system without a Docker, although it supports fully Dockerized setup as well, I believe. Here is the config example (this one has only two groups defined). It was created before recipes, so I used separate shell scripts to run launch-cluster.sh, but now it can just call run-recipe.sh with model name as a parameter.

stop-cluster.sh just stops the container by calling launch-cluster.sh stop

healthCheckTimeout: 500

macros:
  "llama-server": >
    /home/eugr/llm/llama.cpp/build/bin/llama-server
    --port ${PORT}
    --offline
    --no-mmap

models:
  "minimax-m2":
    useModelName: "QuantTrio/MiniMax-M2-AWQ"
    proxy: http://127.0.0.1:8888
    cmd: |
      /home/eugr/llm/vllm-launchers/start_minimax.sh
    cmdStop: |
      /home/eugr/llm/vllm-launchers/stop_cluster.sh

  "gpt-oss-120b":
    # ttl: 300
    cmd: |
      ${llama-server}
      -hf ggml-org/gpt-oss-120b-GGUF
      --jinja -ngl 99
      --ctx-size 0
      -b 2048 -ub 2048
      -fa on
      --temp 1.0
      --top-p 1.0
      --top-k 0
      --reasoning-format auto
      --chat-template-kwargs "{\"reasoning_effort\": \"medium\"}"
      -kvu
      -np 10

  "glm-4.5-air":
    cmd:
      ${llama-server}
      -hf unsloth/GLM-4.5-Air-GGUF:Q4_K_XL
      --jinja
      -c 0
      -fa on
      -ub 2048

  "qwen3-coder-30b":
    ttl: 600
    cmd: |
      ${llama-server}
      -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q6_K_XL
      --jinja -ngl 99
      -c 131072
      --temp 0.7
      --min-p 0.0
      --top-p 0.80
      --top-k 20
      --repeat-penalty 1.05
      -fa on

  "qwen2.5-vl-7b":
    cmd: |
      ${llama-server}
      -hf unsloth/Qwen2.5-VL-7B-Instruct-GGUF:Q6_K_XL
      --jinja
      -ngl 99 -fa on
      -c 16384
      --temp 0.7
      --min-p 0.0
      --top-p 0.80
      --top-k 20
      --repeat-penalty 1.05

  "qwen3-vl-8b":
    cmd: |
      ${llama-server}
      -hf unsloth/Qwen3-VL-8B-Instruct-GGUF:Q6_K_XL
      --jinja
      -ngl 99 -fa on
      -c 16384
      -kvu
      -np 4

groups:
  "big-ones":
    swap: true
    exclusive: false
    members:
      - "gpt-oss-120b"
      - "glm-4.5-air"
      - "minimax-m2"

  "medium":
    swap: true
    exclusive: false
    members:
      - "qwen3-coder-30b"
      - "qwen2.5-vl-7b"
      - "qwen3-vl-8b"

jd36 · February 7, 2026, 9:17am

Would you consider having a docker compose to orchestrate all that?

eugr · February 7, 2026, 7:02pm

For llama-swap integration, absolutely!

jd36 · February 7, 2026, 7:25pm

I was thinking of the whole stack but that does not seem quite possible with llama-swap

eugr · February 7, 2026, 7:35pm

Yeah, it’s only possible for llama-server. If we want to run Docker containers from it, it needs to run on the host system.

However, since it’s a single binary, it can still be automated via shell scripts and configuration (that can also be autogenerated).

davidbarnesguildford · February 9, 2026, 8:20am

Great post. If anyone manages the get qwen-next-coder (FP4) going I’d be super greatful for the specs. Been trying to get it working for 4 days with no joy.

eugr · February 9, 2026, 5:01pm

Native FP8 version works pretty well on a single Spark and gives ~43 t/s.

armand1m · February 9, 2026, 7:41pm

this is how I am running Qwen Next Coder FP8 on my Spark

gist.github.com

https://gist.github.com/armand1m/8f354797ed39f14e14cea0ed5c52c770

qwen3-vllm.sh

#!/bin/bash
docker run -d \                                                                                                                                                                     
    --name vllm \
    --restart unless-stopped \                                                                                                                                                        
    --gpus all \                                                                                                                                                                    
    --ipc host \
    --shm-size 64gb \
    --memory 110g \
    --memory-swap 120g \
    --pids-limit 4096 \

This file has been truncated. show original

@eugr anything you would change on that?

eugr · February 9, 2026, 7:53pm

I don’t think you need half of those parameters.
This is how I launch it:

./launch-cluster.sh --solo \
exec vllm serve Qwen/Qwen3-Coder-Next-FP8 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--gpu-memory-utilization 0.8 \
	--host 0.0.0.0 --port 8888 \
	--load-format fastsafetensors \
	--attention-backend flashinfer \
	--enable-prefix-caching

This is using our community docker at GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

Should run with your Docker image too with the same vLLM command (minus fastsafetensors bit as I don’t think it’s included in that image).

armand1m · February 9, 2026, 8:04pm

thanks! that image does have the fastsafetensors

Loading safetensors using Fastsafetensor loader: 2% Completed | 1/40 [00:01<00:59, 1.53s/it]

I used your repo scripts back in december/january and built some images using it before but I haven’t used in a while after I found this scitrera image. I might give it a shot again as I saw some commits came in.

do you feel like the lack of a specific MoE config file (as warned by vLLM) could be something worth diving into?

eugr · February 9, 2026, 8:10pm

The one from Triton? It can help performance if you could generate the config. I think it can be done with vllm bench, but I never had time for that.

Our build has got a lot of improvements in the past month:

optimized build for gpt-oss by @christopher_owen
just migrated to pytorch 2.10/triton 3.6.0
lots of improvements in launch script, new mods
new recipe system for launching models more easily, thanks to @raphael.amorim

You can use the launch-cluster/run-recipe and its functionality (in cluster or solo) with other images, including scitrera ones, btw.

jd36 · February 11, 2026, 5:59pm

@eugr , will you be adding it to spark-vllm-docker/recipes at main · eugr/spark-vllm-docker · GitHub ?

eugr · February 11, 2026, 6:27pm

Adding what?

Topic		Replies	Views
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3947	March 6, 2026
DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference DGX Spark / GB10 Projects docker , spark , llm	9	1613	February 13, 2026
Managing Local LLM Orchestration DGX Spark / GB10 Projects	11	1171	March 13, 2026
New pre-built vLLM Docker Images for NVIDIA DGX Spark DGX Spark / GB10	73	7314	March 27, 2026
DGX Spark performance DGX Spark / GB10	50	4183	February 27, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2826	December 31, 2025
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	529	March 3, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2247	February 25, 2026
Running Step-3.5-Flash on Single Spark DGX Spark / GB10 Projects jetson , llama	20	2589	February 9, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10 deepseek	21	3607	January 25, 2026

HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!)

Related topics