vLLM containers

Nvidia somehow took down their vLLM playbook. But it’s still listed in dgx-spark-playbooks/nvidia/vllm at main · NVIDIA/dgx-spark-playbooks · GitHub

Is something not compatible with the Spark now?

While I am at it, how does the Nvidia vLLM container different from the community developed vLLM container at GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub ?

The NGC container will absolutely run. Newest tag now has vLLM 0.15.1; while the vLLM project just released 0.17.

Think of the NGC container as something that needs to support all of Nvidia’s hardware and is highly convenient. It’s optimized for stability and robust deployment.

In contrast, the community docker is narrowly focused on the Spark with Spark-specific optimizations. It will usually run faster, with convenience features and startup scripts/recipes for highly used model types and multi-node inference. Most of that playbook is now obviated by simply using the launch-cluster.sh infrastructure. To make this work, it builds on your own Spark, but build-and-copy.sh is very simple to run and use.

I see. Thanks. I am going to try the community versions

Actually, since last week, it downloads pre-built flashinfer and vllm wheels by default to simplify and speed up building process. These wheels are built nightly on my Spark cluster, run through a regression testing pipeline where I launch multiple model recipes in both solo and cluster configuration and compare performance to previous baseline. If a significant degradation or error is detected, the build fails.

This way you can always have the most up to date, but working version of vLLM.

You can still rebuild from the source by specifying --rebuild-vllm flag (and/or --rebuild-flashinfer).

Does this have any effect on the parameter “–vllm-ref”?

I built with “–vllm-ref v0.17.0” and I got

version 0.17.1.dev0+gb31e9326a.d20260307

If you specify --vllm-ref or --apply-vllm-pr, even without --rebuild-vllm it will build from the source using the appropriate ref.

Looks like the commit hash in your build matches the release version:

The vLLM playbook is undergoing some changes and will be republished soon.

Ah. I saw it came back today! Thanks!

So, since both Nvidia’s vLLM and the community’s vLLM are docker containers, would that make sense to have both containers running to experiment with? I would imagine I will just have to expose the two containers to different ports and use different endpoints?

You can do that, but if you run them at the same time, make sure they are not competing for the same VRAM. By default, vLLM will consume all allocated VRAM (default allocation is 0.9 of total VRAM available). Use --gpu-memory-utilization to lower it.

Having said that, as of now, there is no benefit in using NGC container on Spark. Since Spark optimizations are still not fully implemented in mainline vLLM/flashinfer/cutlass, NGC container won’t run as well as our community version, given that it lags 2 versions behind currently.

Good point. I forgot about that! Is it possible to open the box up and swap in more memory? Has anyone play with that?

If you are talking about (V)RAM, no, it’s soldered.
Only SSD can be replaced.

Ahhh, the soldered RAM is something I didn’t realize. Too bad

Talking about parameters, I’m trying to run two models at once so, yes, I’m using the –gpu-memory-utilization which work. But when I specify –port 8001 for the second model, it always start on 8000 like the first model. That’s with using run-recipe.sh –gpu-memory-utilization 0.2 –solo –port 8001 nemotron-3-nano-nvfp4. I looked at the yaml and, for sure, port is defined as 8000 but I was expecting the –port to change that. Should it and it’s a bug or I have to modify the yaml if I want to run two models at once?

It’ll be easier with sparkrun. sparkrun is essentially designed to replace run-recipe.sh. I’m working with @eugr and @raphael.amorim on Spark Arena.

Check out web or github for install instructions. New version coming out today or tomorrow that’ll also have a setup wizard + better integration with Spark Arena.

Once installed and configured, you could do:

sparkrun run nemotron-3-nano-nvfp4 --gpu-mem 0.2 --port 8001

and it’ll respect the port changes.

Just had a minute looking at that and it seems very cool. Will certain look into this once the new version is published. I don’t have much time today anyway :)

Thank you!

You can specify a different container name instead of default “vllm_node” for a second instance, using --name vllm_node_2 parameter. Then they won’t conflict with each other.

Didn’t realized that. That said, what was happening then? I had a first model running on port 8000. Tried to start a second model on port 8001. Both models are running on port 8000 based on netstat -a. I just restarted both and realized that there is a single container running so my understanding is that vllm is running both model in the same container and I guess that’s why there is only port 8000 showing twice? I’m surprised to see that however.

I’ve tried what you suggested (–name) but it’s not working. It’s not showing either in the run-recipe.sh usage. I tried anyway but when I do that, it try to use the name I give as the model name.

./run-recipe.sh --gpu-memory-utilization 0.2 --name vllm_node_2 --solo nemotron-3-nano-nvfp4

Error: Recipe not found: vllm_node_2

Searched in: vllm_node_2, /mnt/unity/efortin/projects/spark-vllm-docker/recipes

Thank you.

you can run

git pull
./run-recipe.sh recipes/nemotron-3-nano-nvfp4.yaml --solo --gpu-memory-utilization 0.2

you might need to limit the max-mode-len because of GPU memory limitation

./run-recipe.sh recipes/nemotron-3-nano-nvfp4.yaml --solo --gpu-memory-utilization 0.2 --max-model-len 200000

You need to pull from the repo first. Your version seems to be outdated.