The NGC container will absolutely run. Newest tag now has vLLM 0.15.1; while the vLLM project just released 0.17.
Think of the NGC container as something that needs to support all of Nvidia’s hardware and is highly convenient. It’s optimized for stability and robust deployment.
In contrast, the community docker is narrowly focused on the Spark with Spark-specific optimizations. It will usually run faster, with convenience features and startup scripts/recipes for highly used model types and multi-node inference. Most of that playbook is now obviated by simply using the launch-cluster.sh infrastructure. To make this work, it builds on your own Spark, but build-and-copy.sh is very simple to run and use.
Actually, since last week, it downloads pre-built flashinfer and vllm wheels by default to simplify and speed up building process. These wheels are built nightly on my Spark cluster, run through a regression testing pipeline where I launch multiple model recipes in both solo and cluster configuration and compare performance to previous baseline. If a significant degradation or error is detected, the build fails.
This way you can always have the most up to date, but working version of vLLM.
You can still rebuild from the source by specifying --rebuild-vllm flag (and/or --rebuild-flashinfer).
So, since both Nvidia’s vLLM and the community’s vLLM are docker containers, would that make sense to have both containers running to experiment with? I would imagine I will just have to expose the two containers to different ports and use different endpoints?
You can do that, but if you run them at the same time, make sure they are not competing for the same VRAM. By default, vLLM will consume all allocated VRAM (default allocation is 0.9 of total VRAM available). Use --gpu-memory-utilization to lower it.
Having said that, as of now, there is no benefit in using NGC container on Spark. Since Spark optimizations are still not fully implemented in mainline vLLM/flashinfer/cutlass, NGC container won’t run as well as our community version, given that it lags 2 versions behind currently.
Talking about parameters, I’m trying to run two models at once so, yes, I’m using the –gpu-memory-utilization which work. But when I specify –port 8001 for the second model, it always start on 8000 like the first model. That’s with using run-recipe.sh –gpu-memory-utilization 0.2 –solo –port 8001 nemotron-3-nano-nvfp4. I looked at the yaml and, for sure, port is defined as 8000 but I was expecting the –port to change that. Should it and it’s a bug or I have to modify the yaml if I want to run two models at once?
Check out web or github for install instructions. New version coming out today or tomorrow that’ll also have a setup wizard + better integration with Spark Arena.
Once installed and configured, you could do:
sparkrun run nemotron-3-nano-nvfp4 --gpu-mem 0.2 --port 8001
Just had a minute looking at that and it seems very cool. Will certain look into this once the new version is published. I don’t have much time today anyway :)
You can specify a different container name instead of default “vllm_node” for a second instance, using --name vllm_node_2 parameter. Then they won’t conflict with each other.
Didn’t realized that. That said, what was happening then? I had a first model running on port 8000. Tried to start a second model on port 8001. Both models are running on port 8000 based on netstat -a. I just restarted both and realized that there is a single container running so my understanding is that vllm is running both model in the same container and I guess that’s why there is only port 8000 showing twice? I’m surprised to see that however.
I’ve tried what you suggested (–name) but it’s not working. It’s not showing either in the run-recipe.sh usage. I tried anyway but when I do that, it try to use the name I give as the model name.