Of course you can install it as every normal python program. You can find very extensive documentation on the official docs site:
# Install vLLM with a specific CUDA version (e.g., 13.0).
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
export CUDA_VERSION=130 # or other
export CPU_ARCH=$(uname -m) # x86_64 or aarch64
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux_2_35_${CPU_ARCH}.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}
This will install the latest release. No useful patches that might came up later.
BTW: There is no CUDA 12.1a - 12.1a refers to the architecture (GB10) or compute capability as it is called in the official documentation:
But be aware - when not using the community eugr edition of vLLM you will miss some of the patches that improve the overall performance and already fixes annoying bugs that spoil the fun, especially with Gemma4. vLLM has still open issues for Gemma4.
EDIT: You might need to update the transformers version manually. Last time I checked the official build it was still using an older transformerversion as needed (>=5.5.0) for Gemma4.
Total newbie here. Everyone one of you knows way more than me but I wanted to share my experience. Yesterday I decided to give @eugr ‘s implementation using the sparkrun software package software. It loaded it in vLLM with about 116GB of RAM usage. I opened up a basic docker container with open-webui to “chat” with it. Oh my goodness!! I have NEVER seen a model respond so quickly, it was instantaneous. No matter what questions I threw at it. Advanced physics, engineering, electrical design, random puzzles, like a champ, I chatted with it for several HOURS same session never slowed down never went sideways. My cache hit usage hit high 90’s and honestly I forgot to look at tokens per second. But man I have seen nothing like it so far on the spark. For at least a semi-large (ish) model. I only have 1 spark. Sadly it uses too much memory when I tried to fire up my other models to run in Agent Zero, I was going to replace my Qwen3 model (also from Eugr) with Gemma to test it out but I run out of memory and it shuts itself down. I know I could probably change the amount of usage down a bit and make it work but I dont want to lose any of the “brain” of gemma4.
It’s one of two models that answered the “car wash riddle” correctly. I have tested a lot of models too! If you are unfamiliar with the riddle give it a try in your favorite LLM to see if it can figure it out it goes as follows:
I live next to a car wash. My car is very dirty. It needs a wash. Should I walk or drive to the car wash? —that’s it. Many models will say to walk which well you know misses the point. Some go on and rant about how its safer and better for the environment blah blah you name it. Usually it will self correct if you reply back with “well if I walk should I carry my car on my back?” most figure it out then some throw weird answers like “trying to carry on your back is dangerous” lol. Anyways, TL/DR Gemma 4 on vLLM with Eugr’s not so secret sauce is amazing. If we could just reduce the RAM down some more without loosing the speed and accuracy that would awesome. Come on Eugr I know you can do it! :) Cheers all!
I regret having to make a reality check, but Gemma 4 31B is a dense model. With 31B active parameters and 273 GB/s memory bandwidth on Spark we are not going very far. Granted, there is still more tok/s to be squeezed with more aggressive quants. But as long with retain at least 4 bits quants the inference will remain sluggish at best. 🤷🏻♂️ At least this is the expected behaviour on a single Spark; dense model will benefit from parallelism and, there, YMMV indeed.
Understand, I’m not running sparkrun, but newest spark-vllm-docker and followed the instructions on the Spark Arena - LLM Leaderboard which does not seem to function when following the How to Use instructions :-(
Do you have particular logs/issues you could share?
I’ll try it out directly with the run-recipe in spark-vllm-docker.
Note that the working version of the plan is for sparkrun to be the authoritative means of running recipes, so I always recommend that first. sparkrun uses spark-vllm-docker to provide vllm, but supports more complexity and growth in what we can do with recipes.
The --setup works, docker created, when the recipe starts to load in --solo mode, it starts with missing “Error: Recipe missing required field: name” and then half dozen items missing for not found if i manually add a name field. Most of it seems to be items listed in defaults not loading into the command.
~/spark-vllm-docker$ ./run-recipe.sh gemma4-26b-a4b-AWQ --solo --setup --served-model-name cyankiwi/gemma4-26b-a4b-AWQ
Warning: Recipe uses schema version ‘2’, but this run-recipe.py supports: [‘1’]
Some features may not work correctly. Consider updating run-recipe.py.
Recipe: Gemma4-26B-A4B-AWQ
Thanks for reporting that, there’s a bug right now where Version 2 recipes (Sparkrun only) are showing instructions to run them on spark-vllm-docker. I’ll fix it.
I’ll save you a really long explanation on this (which you can find in abundance elsewhere on this forum):
-NVFP4 is not performant on DGX Spark (still/yet)
-AWQ/Autoround quants almost always offer faster speed and equal or better quality
-The boost NVFP4 is supposed to provide on Blackwell doesn’t work on our Blackwells.
On the vLLM side, we’re tracking a few reasoning parser and tool call parser fixes for streaming responses with Gemma4 as well as some general chat template issues with the model that impact all inference servers in multi-turn agentic workflows that we’re sorting out. These should all make their way into vLLM main and releases shortly, but feel free to ping me (bbrowning) on issues opened in vLLM’s github if you hit specific issues that need triage and fixing there.
As a fellow DGX Spark daily driver, thanks for being so on top of these things!