Good news is that native FP8 version is supported out of the box in our community Docker and performs reasonably well at ~43 t/s on a single Spark.
Please note that if you launch with parameters on the model card, vLLM will disable prefix caching which will really affect any coding workflows due to prompt re-processing at each request. Also, by default it uses FLASH_ATTN backend which will allow only ~60K tokens with 0.8 memory utilization for context. With Flashinfer backend KV cache will fit ~170K tokens without quantizing to fp8!!!
Here is how you can run with prefix caching enabled. vLLM says that prefix caching support for this architecture is experimental, but it seems to work OK:
I don’t know if you’ve seen ngram-mod or not, but it really can make LLMs fly in certain iterative coding tasks, which I also feel come up in agentic workflows where an LLM reads a file then modifies it.
If they fix ngram-mod for Qwen3-Next models, then it would be a hard choice between vLLM and llama-server at that point. I think vLLM needs to consider implementing this same feature.
vLLM has some kind of “suffix decoding” specdec via “arctic-inference” which might be similar, but I haven’t tried it, and the fact that I’ve really never heard anyone mention it doesn’t inspire much confidence, but maybe it is great.
It sounds similar to spec decoding for some of the models in vLLM, like GLM-4.7. I’m on the fence for those ones. The performance becomes very uneven. Sometimes it performs faster, but then slows down, so on average it’s pretty much the same. Haven’t tried llama.cpp implementation though.
I feel like even with this feature, vLLM will still be ahead for coding/agentic flows because of generally much faster prompt processing.
This speculation is based on the previous history of the conversation, not a small decoder head or draft model. The video in the PR shows how crazy fast this can be, because it’s not predicting a couple of tokens ahead, it is predicting dozens of tokens ahead.
For batch size 1 tasks, predicting only a few tokens ahead never gives any real speedup with MoE models because you’re still so constrained by bandwidth, but you’ve seen how much faster prompt processing is than token generation, because there is a breakeven point where you’re much faster even for batch size 1.
Nice Post, thank you for this. How much memory does this take up with KV cache. Interested to see what else I could run at the same time for a specialist coding stack on Single Spark.
I’m running at 0.8 memory utilization, so ~92GB. We need to wait for AWQ/FP4 quants to be able to fit into a smaller memory footprint (and also make it run 2x faster).
No, fresh build didn’t help. Looks like there a bug in Triton implementation. I tried to force Flashinfer CUTLASS MOE, but it failed with NotImplementedError: Found VLLM_USE_FLASHINFER_MOE_FP8=1, but no FlashInfer FP8 MoE backend supports the configuration.
Well, even NVFP4 quants don’t work in the cluster. The only thing that makes it work with two nodes is to use --enforce-eager, but that kills performance, so it’s worse than a single node. Setting up an allocator like suggested in the error message didn’t work either, but I guess Triton initialization is a bit more complex, so that needs more troubleshooting, and I don’t have time for that.
@eugr what would be awesome is a way to document benchmarks for specific models and setups.
Maybe in your /spark-vllm-docker docs or having a shared sheet with models and benchmarks similar to how you posted in this thread. This helps out a ton.
To take it a step further, have the specific ./build-and-copy.sh and ./launch-cluster specific command runs that worked with each.
The reason being is that certain build & launch parameters may work at one point, but may not work further at a later date (using nightly builds / wheels, etc).
It would also allow us to help out in fine tuning and pushing benchmarks past what the current posted (t/s)
Yes, I’m actually working on it. I have a lot of notes in different places, trying to organize them now.
There is also a PR by @raphael.amorim that we are working on merging that adds “model recipes” - launch templates that allow almost “one-click” launching of models.
Well, unfortunately it gives the same triton.allocator error on my system.
I wonder if it’s somehow connected to the fastsafetensors workaround that I’m using for cluster setups. I’ll try to build without it and see if it works.