VLLM -- the $150M train wreck?

Can anyone suggest a tag of VLLM that works reliably on the DGX Spark ray configuration? The number of breaking commits and regressions in this codebase is making my head spin.

I’m getting core engine failures with GPT-OSS-120b, Qwen3-VL-235b (NVFPR and AWQ) and GLM 4.6V (with appropriate TF5). I was able to get each one of these working at one time, but now I can’t get any of them to reliably run.

Thanks

I really admire llama.cpp. Far fewer contributors, yet they are much more nimble and produce (IMHO) much better code. NCCL support on that stack would be enough to kick vLLM to the curb.

you could try my version for gpt-oss-120b. GitHub - christopherowen/spark-vllm-mxfp4-docker

The ‘community docker’ is a great resource: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

generally, stick to release builds rather than head for vllm.

I have been using the community docker, including with and without mxfp4 bits. I’ll try your docker as well.

Just wondering if others have had this problem or I have to nuke and reimage from scratch. Have delete cache directories, recloned and rebuilt from source, pruned builder cache, etc…

Thanks

You are not alone. Nightly builds are like a box of chocolates, you never know what you’re going to get.

Hmm… No problem here with the latest builds. They do break things, and I try to introduce fresh patches as soon as possible, so we could continue to build from main.

However, I am getting close to setting up a nightly build pipeline on my sparks so it could build and test through a series of recipes (in the cluster and separately), so we could always have the latest build that was tested.

This stable build will be uploaded as an artifact to GitHub (prebuilt wheel) and possibly even to Dockerhub as a complete image, although, building from prebuilt wheels would be faster than pulling the entire thing.

The only question is whether I should:

  1. Make the stable build default (like I do with flashinfer) and trigger an actual rebuild only if --rebuild-vllm is specified.
  2. Introduce a --stable flag.

I think #1 would be better and more consistent with how I handle flashinfer. In all cases, it will first check if the published wheel is newer than the one stored locally.

Hmm could always be something with my system. I’ve tried to clean everything out, but I’ve done a lot of “experimenting” so who knows what ghosts lie therein.

My vote is #1, thanks!

I’m pretty new here and love your work. As for flags though; I’d suggest that the stable build should be default and then --nightly would be the flag.

You sound like I did last week. Updates made it to where one node even failed nvidia-smi output.

Check again on updates/firmware cycle and be sure to reboot. Also, in my case, had to configure NCCL correctly using eugr’s networking notes in the community repo.

i have pinned working checkout in the podman file, because the quality management is missing. Tests? It seems that they push everything on HEAD. Dont know who cleans this up. Sometimes it compiles.

Pin your working checkout in the buildfiles!

Does mxfp4 mean that this particular vLLM image is compatible with mxfp4 models?

in this case, it quantizes to mxfp4 and prefers smaller activations - to reduce memory bandwidth usage and increase TPS.

The tagged releases are typically pretty stable.

That’s why I also publish prebuilt community images aligned with the tagged releases for vllm (and sglang) (links below).

They fill a different niche than @eugr’s repo. I essentially release at the same cadence as the projects (so faster than NVIDIA’s images but slower than eugr). Although if eugr starts doing reliable nightly, prebuilt images, I might drop doing vllm…

The amount of approved PRs breaking stuff skyrocketed recently…
This PR broke Qwen3-Next autoround quants that were working perfectly fine just yesterday: [BUGFIX][Qwen3.5] Hardcode `mlp.gate` as not quantizable by vadiklyutiy · Pull Request #35156 · vllm-project/vllm · GitHub

EDIT: I flagged it to vLLM folks, will be pushing a mod in the meantime.

EDIT 2: It looks like this model (Intel/Qwen3-Coder-Next-int4-AutoRound) quantizes the layers it shouldn’t, but the PR fails in the manner it shouldn’t as well. We are discussing it with devs.

EDIT3: pushed the mod - use with: --apply-mod mods/fix-qwen3-next-autoround

Thanks for sharing your setup. I ran your setup just fine on a single DGX Spark, but when using qwen3-coder-next with claude code, it did not really work so well, unfortunately. Not due to your setup, but more due to the limitations in the vLLM v1/messages API, as vLLM does not implement all beta features (like count_tokens). I could turn this off before running claude code, but it would still sometimes exceed my token count, and the API would crash. I had more success with the setup mentioned here. It uses LiteLLM for implementing the API, and is based on the work done by Avarok, making NVFP4 work well on a Spark. Hope this helps others too.

FYI, his method of running NVFP4 on Spark works with stock vLLM (including our community Docker) as well. Marlin NVFP4 kernel is a part of vLLM, so his patches don’t do anything there. Discussed here.

Yes, LiteLLM works well for proxying /messages too. I used it before, but switched to native vLLM endpoint recently for testing, and it worked reasonably well. But yes, count tokens is not implemented currently.

You will probably need to edit the provided litellm config:

PSA: if you built your image on 2/24 or 2/25 and experienced issues with loading quantized models, please rebuild now - they fixed a major regression: Revert "[Misc] Enable weights loading tracking for quantized models" by LucasWilkinson · Pull Request #35309 · vllm-project/vllm · GitHub

Any change I need to make in particular?
I’ve asked the developer about the max_tokens config, which seemed low to me, although it is not related to the max_tokens of the model, though. According to what I’ve read, it is related to the max_completions_tokens, which was around 8k voor previous Anthropic models according to this issue.

@eugr Thanks for the quick update!
And would it be easy to add a recipe for the Qwen3.5 models? Perhaps in combination with a recipe mod to include LiteLLM too? I mainly use LLMs for developing locally, so that would make usage with claude code or codex more reliable, I guess, untill vLLM has updated their API.

Yes, we’ll be adding recipes soon.
As for LiteLLM, not sure what you are asking. To include documentation setting up LiteLLM proxy? Or to deploy LiteLLM automatically?