I switched to vllm-openai-0.21 once it was released as it incorporated mistral support for mistral 4 small 119B A6. 5B I love so much. No support for eagle still, though. I was so happy I switched all models and vllm instances to it. Only days later I noticed performance went down on qwen 3.6 27b. Looking at logs I realized mtp prediction rate turned 0% on every single request. So 0.21 broke qwen mtp. So mistral stayed on 0.21 while rest of the fleet including qwen 3.6 27b and Nemotron cascade 2 went back to 0.20, specifically vllm-openai-gb10-0.20 release tailored to gb10.
Can you clarify what this image/tag is? I can’t see any tags containing gb10 on their Dockerhub page.
That’s the official vllm image, it’s not tailored for the DGX Spark and you should use the eugr images instead: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub
Well it works for Mistral, lol :) Regarding EUGR image and sparkrun project, they tend to use dev and nighlies. Good for experiments, not good for real stable use. Besides an hour ago I went to Spark Arena, looked at the stats published for Cascade 2, spoon up the recepe, tested. 40-45 tg tps on 1 seq, while this image above gives me very stable 59 tg tps. I am very practical person.
Heh… any software version that begins with a zero 0.21 is implicitly not ready for production :-).
I would recommend building from source using eugr_nv’s build_and_copy.sh or use the same tool to get eugr_nv’s wheels.
Either way, because it is still Alpha software, you have to do your own QA.
SparkRun is great but use it for reelz I need to patch scripts - it spoons up instance as docker run command with no detached logs, no instance persist. I can but then every update will break it. Maintaining it, merging - meh. If it gives significant boost - sure, but if not - why? I cook my instances, run own scripts, autorestart. Sparkrun is great for total nubies - just hit a script. I constantly fine tune my docker create commands, adjust memory use, test new models.
PS I DID NOT look to much into it, may well have flags (it has a lot) to do all of that. The simple command does just docker run. I have it, I test it. Should I see better results than vanilla - I will definitely look into it closer. My QSPF56 cable arrives in 2 weeks, that would be a good opportunity to get deeper as it is definitely will be much easier with sparkrun than with manual vanilla containers, doing ncc and setup manually
PPS sorry I keep saying sparkrun for some reason, while I mean eugr vllm spark docker. I just cloned and tested them at the same time and was under impression sparkrun uses eugr build (is it?). Sorry for mixup folks
The latest pull is broken, and by the way, the community docker uses a software that begins with 0.2… that’s the current vLLM version.
yeah it was updated yesterday to 0.21.1dev10 - my last pull was 2 weeks old, that explains missing artifacts etc. Anyways, I am back again to 0.20 stable release and my custom scripts. Mistral 4 stays on 0.21 stable. It is a way of open source software and essentially experimental hardware, that surprisingly works very well if you dump a lot of time into research and testing. Like NVIDIA released trtllm with Spark support in version 09 and now its on 13 and it does not work anymore with Spark. Roadshow for Spark marketing ended, why bother :)
keep containers for logs: sparkrun run <recipe> --no-rm
override image from existing recipe: sparkrun run <recipe> --image alternate-container
You can also always make your own recipes with whatever images you prefer. It has functionality for automating/orchestrating, even if you need to do pre-/post- serving patching.
sparkrun and @eugr’s spark-vllm-docker are different, but surprise! we’re both part of the Spark Arena team and collaborate on all of this.
sparkrun is not specifically limited to spark-vllm-docker but does have first class support for it. sparkrun is meant to help you manage orchestration logistics whether on single node or clustered.
Its a cool project, agree. I did the recipe, but the problem I see they push the latest image wheels pinned by Eugr for everything. Unfortunately it just does not work this way as this thread is explaining. Vllm 0.21x fixed Mistal (and DeepSeek too?) but broke MTP on Qwen 3.6 and the current dev version pinned by Eugr slowed down Mistral too, from 29 tps in 0.21 release to 16 in 0.21.1dev10 he has pinned. This is a reality of bleeding edge. I have so many repos and builds across my two sparks I am already loosing it. Vllms, Llamacpp custom builds, TensorRTLLM, SGLangs, NIM,…
Yeah it’s messy. Typically eugr’s build has been good. (He maintains is a spark-specific regression check pipeline and actually, he’s skipped a bunch of releases due to that. It’s not just nightly.) That being said, bunch of real problems post vllm 0.19. It’ll get handled, but yes… that’s what’s bad about bleeding edge.
Unfortunately, despite its widespread use, vllm is still a very young/immature project.
(1) It’s messy but maintain docker container builds of what works; and/or
(2) Maintain your own recipes that use your approved container builds
Despite that it sounds like I’m telling you to do extra… maintaining a directory of recipe yaml files, ideally even in a git repo so that you can track change is actually one of the better/easier way to keep things straight. You can have yaml files for different models or use-cases, try them out (using pinned versions, not “latest” for containers), commit what works, and then you actually have a traceable record). If you think what you’re doing is great, you can even publish your recipes for others (Registries | sparkrun).
That is what sparkrun is for. It is not specific to spark-vllm-docker. It is not specific to vllm. But yes, the default container source for vllm recipes is spark-vllm-docker, but nobody said you had to use that…
For llama.cpp, I have continuously updating spark-specific containers being built: https://github.com/spark-arena/dgx-llama-cpp/pkgs/container/dgx-llama-cpp. Using “latest” means bleeding edge and incurs risk, but if you maintain your own recipe files, you can just put in whatever release worked for you and use stable, versioned containers (e.g. ghcr.io/spark-arena/dgx-llama-cpp:b9253-cu131)
For sglang, less choice, I build containers periodically, but I haven’t been on top of sglang lately, the 0.5.12 container is experimental scitrera/dgx-spark-sglang:0.5.12, 0.5.11 was pretty stable. scitrera/dgx-spark-sglang - Docker Image. | New pre-built sglang Docker Images for NVIDIA DGX Spark - #30 by dbsci
And obviously you can build your own containers or use any others that others have published.
So anyway, I agree with you that vllm 0.21 is a bit broken, but the other point you made in this thread is that it’s getting hard to keep track of things and that you have a problem with everything being bleeding edge. I just wanted to let you know that there are tools out there that help you control it.
Thanks a lot for so much information. It’s been a very hardearning ing curve for me in last six weeks or so, lol. I barely touched Linux or bash for previous 20+ years, lol
But I feel grateful. It’s a steep learning curve, but it’s good to feel being in control. From being windows centric for 20 years to having two mini servers with Ubuntu on my desk and a desire to dump windows for good (hard as I have part of project tied to windows only frameworks, apps and vs 18). All because a decision to go local Ai all on very recently.
And I am appreciative of eugr, vllm, spark and all good work community does. It’s open source. People donate their blood and tears for good of community. It hard to criticize. And the landscape is insane too. Tech is literally changing by the hour. In my 30 years career in it and fintech I never seen anything close. Even bitcoin and crypto evolution post 2016 that felt nausetengly fast looks like a standstill now. Yet we complain that the models released literally 12 hours ago not working beautifully with one keystroke despite using all new architecture. Lol. You get used to good things fast.
I still remember days when I bought my first nvidia evga card and spent days testing interruptions just to show something on my 720 display, let alone play quake ;)
Yes, I have a pipeline, but it tests only select models. I will add more models to the list once we have a 24/7 build setup - for now it runs at night on my personal cluster that I want to use for other things too :)
running VLLM 0.21 on an AGX Thor (and on RHEL 9.8 with a RTX 3090)
It seems that when trying to stop it, it will not release all resources.
Error message:
APIServer pid=3171) INFO: Application shutdown complete.
(APIServer pid=3171) INFO: Finished server process [3171]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore
objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
This results in the GPU memory not being released and a subsequent crash if I try to restart vllm.
Looks like a bug to me, even though I am not in a position to track this down myself ?