Like many of you, I was incredibly excited to get my hands on the DGX Spark (GB10), but that excitement quickly turned into frustration when I realized how much time I was spending just on the “plumbing” - getting the drivers to play nice, configuring the container runtime for the architecture, and wrestling with multi-model handling.
I realized we are all probably reinventing the wheel in our own silos.
So, I decided to open-source my internal stack. The goal is simple: Turn the DGX Spark setup from a weeks-long project into a 30-minutes task.
“Production”-Ready Inference: A pre-configured Docker Compose stack for serving large models (e.g., OSS, Llama 3, Qwen) using optimized vLLM without the headache of manual flag tuning. And I don’t mount IPC! Just like the official guides say.
Observability: Built-in monitoring for memory usage and optimize later, because we all know how hot these Blackwell chips can run under load.
Why I’m posting this:
I want this to be the “community” starter kit so we can focus on building apps, not debugging drivers and models. I’m looking for contributors to help testing, adding models and to add improvements (some possible suggestions in the TODO.md).
If you’re tired of the setup grind, give it a spin and let me know what breaks. PRs are very welcome!
Let’s make the DGX Spark actually usable for everyone here. 🚀
One feedback. A lot of the current setup are aimming for chatbot/programming type of inference. Stable Diffusion setup might be a little bit different. My opinion is that it might be safer to make that a separate project. The major issue is the server: vLLM and Sglang. They usualy have a separate branch (usually call omni) to deal with Stable Diffusion. Omni branch might not keep up with the development of the main branch.
Of course, when having multimedia inference, there will be a change in the waker’s monitored prefix, which is currently set to vllm-. I could become inference- for instance. If no one gets ahead of me (it would be great if someone does), I’ll start exploring this in a month or 2 because this is a free-time project.
Hi @eugr ! I think your repo is more oriented towards people working directly on their spark or did I miss something? The goal of what I made is that you curl or otherwise call the model and it loads on demand and switches off after some time if not in use.
No, it just provides a tested/optimized way to run any vLLM supported model on Spark - either standalone or cluster.
For instance, there are other solutions, well tested and maintained for model switching/proxying, e.g. llama-swap (model loading/switching on demand) and litellm - proxy with fallbacks, etc.
My personal stack is llama-swap sitting on Spark and providing model loading/switching between a mix of vLLM and llama.cpp models, launching both on a single spark and in the cluster. I group models by size, so I can have three models running at the same time (if needed) - one large running on a cluster (e.g. minimax-m2.1 via vLLM), one medium-sized (qwen3-vl-8b in q8 via llama.cpp) and one embedding model (qwen-embedding-8b, currently via llama.cpp, but will probably switch to vllm).
And I also have LiteLLM that I use as my main endpoint for all clients that routes requests to one of my servers (not just Spark cluster) with fallback, etc. It also serves cloud models (Claude, ChatGPT).
I guess, there is a value to provide a “one-click” type integration that can set up llama-swap and our community docker (and llama.cpp) together without reinventing both.
I see, thanks for the pointers! I have to look into it when I get time to seat down. Have you been able to add test more models than you posted in spark-vllm-docker/recipes at main · eugr/spark-vllm-docker · GitHub ? The problem, when you have a single spark is that you still need a small model alongside.
Also, how would you see a possible “one-click” type integration that can set up llama-swap and your repo? Does it run a script or what would be the entry point?
Yes, me and @raphael.amorim are working on it. You are welcome to join!
llama-swap is just one self-contained binary, so it can run on a host system without a Docker, although it supports fully Dockerized setup as well, I believe. Here is the config example (this one has only two groups defined). It was created before recipes, so I used separate shell scripts to run launch-cluster.sh, but now it can just call run-recipe.sh with model name as a parameter.
stop-cluster.sh just stops the container by calling launch-cluster.sh stop
Great post. If anyone manages the get qwen-next-coder (FP4) going I’d be super greatful for the specs. Been trying to get it working for 4 days with no joy.
I used your repo scripts back in december/january and built some images using it before but I haven’t used in a while after I found this scitrera image. I might give it a shot again as I saw some commits came in.
do you feel like the lack of a specific MoE config file (as warned by vLLM) could be something worth diving into?