vLLM containers

WillLee · March 7, 2026, 7:21pm

Nvidia somehow took down their vLLM playbook. But it’s still listed in dgx-spark-playbooks/nvidia/vllm at main · NVIDIA/dgx-spark-playbooks · GitHub

Is something not compatible with the Spark now?

While I am at it, how does the Nvidia vLLM container different from the community developed vLLM container at GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub ?

joshua.dale.warner · March 7, 2026, 7:48pm

The NGC container will absolutely run. Newest tag now has vLLM 0.15.1; while the vLLM project just released 0.17.

Think of the NGC container as something that needs to support all of Nvidia’s hardware and is highly convenient. It’s optimized for stability and robust deployment.

In contrast, the community docker is narrowly focused on the Spark with Spark-specific optimizations. It will usually run faster, with convenience features and startup scripts/recipes for highly used model types and multi-node inference. Most of that playbook is now obviated by simply using the launch-cluster.sh infrastructure. To make this work, it builds on your own Spark, but build-and-copy.sh is very simple to run and use.

WillLee · March 7, 2026, 7:50pm

I see. Thanks. I am going to try the community versions

eugr · March 7, 2026, 10:33pm

Actually, since last week, it downloads pre-built flashinfer and vllm wheels by default to simplify and speed up building process. These wheels are built nightly on my Spark cluster, run through a regression testing pipeline where I launch multiple model recipes in both solo and cluster configuration and compare performance to previous baseline. If a significant degradation or error is detected, the build fails.

This way you can always have the most up to date, but working version of vLLM.

You can still rebuild from the source by specifying --rebuild-vllm flag (and/or --rebuild-flashinfer).

brian322 · March 7, 2026, 11:51pm

Does this have any effect on the parameter “–vllm-ref”?

I built with “–vllm-ref v0.17.0” and I got

version 0.17.1.dev0+gb31e9326a.d20260307

eugr · March 8, 2026, 12:38am

If you specify --vllm-ref or --apply-vllm-pr, even without --rebuild-vllm it will build from the source using the appropriate ref.

Looks like the commit hash in your build matches the release version:

aniculescu · March 9, 2026, 7:09pm

The vLLM playbook is undergoing some changes and will be republished soon.

WillLee · March 9, 2026, 7:23pm

Ah. I saw it came back today! Thanks!

So, since both Nvidia’s vLLM and the community’s vLLM are docker containers, would that make sense to have both containers running to experiment with? I would imagine I will just have to expose the two containers to different ports and use different endpoints?

eugr · March 9, 2026, 7:57pm

You can do that, but if you run them at the same time, make sure they are not competing for the same VRAM. By default, vLLM will consume all allocated VRAM (default allocation is 0.9 of total VRAM available). Use --gpu-memory-utilization to lower it.

Having said that, as of now, there is no benefit in using NGC container on Spark. Since Spark optimizations are still not fully implemented in mainline vLLM/flashinfer/cutlass, NGC container won’t run as well as our community version, given that it lags 2 versions behind currently.

WillLee · March 10, 2026, 3:11pm

Good point. I forgot about that! Is it possible to open the box up and swap in more memory? Has anyone play with that?

eugr · March 10, 2026, 4:15pm

If you are talking about (V)RAM, no, it’s soldered.
Only SSD can be replaced.

WillLee · March 10, 2026, 4:35pm

Ahhh, the soldered RAM is something I didn’t realize. Too bad

ehfortin · March 26, 2026, 5:26pm

Talking about parameters, I’m trying to run two models at once so, yes, I’m using the –gpu-memory-utilization which work. But when I specify –port 8001 for the second model, it always start on 8000 like the first model. That’s with using run-recipe.sh –gpu-memory-utilization 0.2 –solo –port 8001 nemotron-3-nano-nvfp4. I looked at the yaml and, for sure, port is defined as 8000 but I was expecting the –port to change that. Should it and it’s a bug or I have to modify the yaml if I want to run two models at once?

dbsci · March 26, 2026, 6:22pm

It’ll be easier with sparkrun. sparkrun is essentially designed to replace run-recipe.sh. I’m working with @eugr and @raphael.amorim on Spark Arena.

Check out web or github for install instructions. New version coming out today or tomorrow that’ll also have a setup wizard + better integration with Spark Arena.

Once installed and configured, you could do:

sparkrun run nemotron-3-nano-nvfp4 --gpu-mem 0.2 --port 8001

and it’ll respect the port changes.

ehfortin · March 26, 2026, 7:18pm

Just had a minute looking at that and it seems very cool. Will certain look into this once the new version is published. I don’t have much time today anyway :)

Thank you!

eugr · March 26, 2026, 7:36pm

You can specify a different container name instead of default “vllm_node” for a second instance, using --name vllm_node_2 parameter. Then they won’t conflict with each other.

ehfortin · March 26, 2026, 8:15pm

Didn’t realized that. That said, what was happening then? I had a first model running on port 8000. Tried to start a second model on port 8001. Both models are running on port 8000 based on netstat -a. I just restarted both and realized that there is a single container running so my understanding is that vllm is running both model in the same container and I guess that’s why there is only port 8000 showing twice? I’m surprised to see that however.

ehfortin · March 26, 2026, 8:31pm

I’ve tried what you suggested (–name) but it’s not working. It’s not showing either in the run-recipe.sh usage. I tried anyway but when I do that, it try to use the name I give as the model name.

./run-recipe.sh --gpu-memory-utilization 0.2 --name vllm_node_2 --solo nemotron-3-nano-nvfp4

Error: Recipe not found: vllm_node_2

Searched in: vllm_node_2, /mnt/unity/efortin/projects/spark-vllm-docker/recipes

Thank you.

raphael.amorim · March 26, 2026, 8:33pm

you can run

git pull
./run-recipe.sh recipes/nemotron-3-nano-nvfp4.yaml --solo --gpu-memory-utilization 0.2

you might need to limit the max-mode-len because of GPU memory limitation

./run-recipe.sh recipes/nemotron-3-nano-nvfp4.yaml --solo --gpu-memory-utilization 0.2 --max-model-len 200000

eugr · March 26, 2026, 8:35pm

You need to pull from the repo first. Your version seems to be outdated.

Topic		Replies	Views
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	4641	February 13, 2026
VLLM -- the $150M train wreck? DGX Spark / GB10 llama	24	1499	February 27, 2026
New pre-built vLLM Docker Images for NVIDIA DGX Spark DGX Spark / GB10	73	9107	March 27, 2026
Run VLLM in Spark DGX Spark / GB10	156	14365	June 8, 2026
New NGC vLLM container image (vllm:26.01-py3) DGX Spark / GB10 cudnn , dali	7	1489	May 3, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	143	7620	February 24, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3255	December 17, 2025
Who wants to be the hero and help a total newbie! Got a spark and um, yeah DGX Spark / GB10 nemotron	7	646	April 3, 2026
I'd like to learn how to use the latest vLLM on DGX Spark DGX Spark / GB10 cuda	9	2429	November 29, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5649	December 9, 2025

vLLM containers

you might need to limit the max-mode-len because of GPU memory limitation

Related topics