HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!)

Hi everyone,

Like many of you, I was incredibly excited to get my hands on the DGX Spark (GB10), but that excitement quickly turned into frustration when I realized how much time I was spending just on the “plumbing” - getting the drivers to play nice, configuring the container runtime for the architecture, and wrestling with multi-model handling.

I realized we are all probably reinventing the wheel in our own silos.

So, I decided to open-source my internal stack. The goal is simple: Turn the DGX Spark setup from a weeks-long project into a 30-minutes task.

I’ve published the initial version here:
👉 https://github.com/jdaln/dgx-spark-inference-stack

What’s in the box right now:

  • “Production”-Ready Inference: A pre-configured Docker Compose stack for serving large models (e.g., OSS, Llama 3, Qwen) using optimized vLLM without the headache of manual flag tuning. And I don’t mount IPC! Just like the official guides say.

  • Observability: Built-in monitoring for memory usage and optimize later, because we all know how hot these Blackwell chips can run under load.

Why I’m posting this:
I want this to be the “community” starter kit so we can focus on building apps, not debugging drivers and models. I’m looking for contributors to help testing, adding models and to add improvements (some possible suggestions in the TODO.md).

If you’re tired of the setup grind, give it a spin and let me know what breaks. PRs are very welcome!

Let’s make the DGX Spark actually usable for everyone here. 🚀

Cheers,
jd36

1 Like

One feedback. A lot of the current setup are aimming for chatbot/programming type of inference. Stable Diffusion setup might be a little bit different. My opinion is that it might be safer to make that a separate project. The major issue is the server: vLLM and Sglang. They usualy have a separate branch (usually call omni) to deal with Stable Diffusion. Omni branch might not keep up with the development of the main branch.

Thank you for the feedback @paulsc.liu ! I can see that you read my comment in DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference - #10 by jd36 .

The reasoning there was that this line: dgx-spark-inference-stack/compose/models-gpt.yml at 37385ff9766ae78f8da6022df7b85de0fb2f3d27 · jdaln/dgx-spark-inference-stack · GitHub could be any docker image that runs inference really. Currently, depending on the model, the vllm image is different and there are 3-4 different ones from memory.

Of course, when having multimedia inference, there will be a change in the waker’s monitored prefix, which is currently set to vllm-. I could become inference- for instance. If no one gets ahead of me (it would be great if someone does), I’ll start exploring this in a month or 2 because this is a free-time project.

It has not been the case for a long time now: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub
I believe most of the people on this forum are already using it and it has already become our community starter kit.

I’m not trying to be critical, don’t get me wrong, just curious what’s different with your setup?

2 Likes

Hi @eugr ! I think your repo is more oriented towards people working directly on their spark or did I miss something? The goal of what I made is that you curl or otherwise call the model and it loads on demand and switches off after some time if not in use.

No, it just provides a tested/optimized way to run any vLLM supported model on Spark - either standalone or cluster.

For instance, there are other solutions, well tested and maintained for model switching/proxying, e.g. llama-swap (model loading/switching on demand) and litellm - proxy with fallbacks, etc.

My personal stack is llama-swap sitting on Spark and providing model loading/switching between a mix of vLLM and llama.cpp models, launching both on a single spark and in the cluster. I group models by size, so I can have three models running at the same time (if needed) - one large running on a cluster (e.g. minimax-m2.1 via vLLM), one medium-sized (qwen3-vl-8b in q8 via llama.cpp) and one embedding model (qwen-embedding-8b, currently via llama.cpp, but will probably switch to vllm).

And I also have LiteLLM that I use as my main endpoint for all clients that routes requests to one of my servers (not just Spark cluster) with fallback, etc. It also serves cloud models (Claude, ChatGPT).

I guess, there is a value to provide a “one-click” type integration that can set up llama-swap and our community docker (and llama.cpp) together without reinventing both.

I see, thanks for the pointers! I have to look into it when I get time to seat down. Have you been able to add test more models than you posted in spark-vllm-docker/recipes at main · eugr/spark-vllm-docker · GitHub ? The problem, when you have a single spark is that you still need a small model alongside.

Also, how would you see a possible “one-click” type integration that can set up llama-swap and your repo? Does it run a script or what would be the entry point?

Yes, me and @raphael.amorim are working on it. You are welcome to join!

llama-swap is just one self-contained binary, so it can run on a host system without a Docker, although it supports fully Dockerized setup as well, I believe. Here is the config example (this one has only two groups defined). It was created before recipes, so I used separate shell scripts to run launch-cluster.sh, but now it can just call run-recipe.sh with model name as a parameter.

stop-cluster.sh just stops the container by calling launch-cluster.sh stop

healthCheckTimeout: 500

macros:
  "llama-server": >
    /home/eugr/llm/llama.cpp/build/bin/llama-server
    --port ${PORT}
    --offline
    --no-mmap

models:
  "minimax-m2":
    useModelName: "QuantTrio/MiniMax-M2-AWQ"
    proxy: http://127.0.0.1:8888
    cmd: |
      /home/eugr/llm/vllm-launchers/start_minimax.sh
    cmdStop: |
      /home/eugr/llm/vllm-launchers/stop_cluster.sh

  "gpt-oss-120b":
    # ttl: 300
    cmd: |
      ${llama-server}
      -hf ggml-org/gpt-oss-120b-GGUF
      --jinja -ngl 99
      --ctx-size 0
      -b 2048 -ub 2048
      -fa on
      --temp 1.0
      --top-p 1.0
      --top-k 0
      --reasoning-format auto
      --chat-template-kwargs "{\"reasoning_effort\": \"medium\"}"
      -kvu
      -np 10

  "glm-4.5-air":
    cmd:
      ${llama-server}
      -hf unsloth/GLM-4.5-Air-GGUF:Q4_K_XL
      --jinja
      -c 0
      -fa on
      -ub 2048

  "qwen3-coder-30b":
    ttl: 600
    cmd: |
      ${llama-server}
      -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q6_K_XL
      --jinja -ngl 99
      -c 131072
      --temp 0.7
      --min-p 0.0
      --top-p 0.80
      --top-k 20
      --repeat-penalty 1.05
      -fa on

  "qwen2.5-vl-7b":
    cmd: |
      ${llama-server}
      -hf unsloth/Qwen2.5-VL-7B-Instruct-GGUF:Q6_K_XL
      --jinja
      -ngl 99 -fa on
      -c 16384
      --temp 0.7
      --min-p 0.0
      --top-p 0.80
      --top-k 20
      --repeat-penalty 1.05

  "qwen3-vl-8b":
    cmd: |
      ${llama-server}
      -hf unsloth/Qwen3-VL-8B-Instruct-GGUF:Q6_K_XL
      --jinja
      -ngl 99 -fa on
      -c 16384
      -kvu
      -np 4

groups:
  "big-ones":
    swap: true
    exclusive: false
    members:
      - "gpt-oss-120b"
      - "glm-4.5-air"
      - "minimax-m2"

  "medium":
    swap: true
    exclusive: false
    members:
      - "qwen3-coder-30b"
      - "qwen2.5-vl-7b"
      - "qwen3-vl-8b"
2 Likes

Would you consider having a docker compose to orchestrate all that?

2 Likes

For llama-swap integration, absolutely!

I was thinking of the whole stack but that does not seem quite possible with llama-swap

Yeah, it’s only possible for llama-server. If we want to run Docker containers from it, it needs to run on the host system.

However, since it’s a single binary, it can still be automated via shell scripts and configuration (that can also be autogenerated).

Great post. If anyone manages the get qwen-next-coder (FP4) going I’d be super greatful for the specs. Been trying to get it working for 4 days with no joy.

Native FP8 version works pretty well on a single Spark and gives ~43 t/s.

this is how I am running Qwen Next Coder FP8 on my Spark

@eugr anything you would change on that?

I don’t think you need half of those parameters.
This is how I launch it:

./launch-cluster.sh --solo \
exec vllm serve Qwen/Qwen3-Coder-Next-FP8 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--gpu-memory-utilization 0.8 \
	--host 0.0.0.0 --port 8888 \
	--load-format fastsafetensors \
	--attention-backend flashinfer \
	--enable-prefix-caching

This is using our community docker at GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

Should run with your Docker image too with the same vLLM command (minus fastsafetensors bit as I don’t think it’s included in that image).

thanks! that image does have the fastsafetensors

Loading safetensors using Fastsafetensor loader: 2% Completed | 1/40 [00:01<00:59, 1.53s/it]

I used your repo scripts back in december/january and built some images using it before but I haven’t used in a while after I found this scitrera image. I might give it a shot again as I saw some commits came in.

do you feel like the lack of a specific MoE config file (as warned by vLLM) could be something worth diving into?

The one from Triton? It can help performance if you could generate the config. I think it can be done with vllm bench, but I never had time for that.

Our build has got a lot of improvements in the past month:

  • optimized build for gpt-oss by @christopher_owen
  • just migrated to pytorch 2.10/triton 3.6.0
  • lots of improvements in launch script, new mods
  • new recipe system for launching models more easily, thanks to @raphael.amorim

You can use the launch-cluster/run-recipe and its functionality (in cluster or solo) with other images, including scitrera ones, btw.

1 Like

@eugr , will you be adding it to spark-vllm-docker/recipes at main · eugr/spark-vllm-docker · GitHub ?

Adding what?