HOW-TO: Run Qwen3-Coder-Next on Spark

Qwen just released their new coding model - Qwen3-Coder-Next.

Good news is that native FP8 version is supported out of the box in our community Docker and performs reasonably well at ~43 t/s on a single Spark.

Please note that if you launch with parameters on the model card, vLLM will disable prefix caching which will really affect any coding workflows due to prompt re-processing at each request. Also, by default it uses FLASH_ATTN backend which will allow only ~60K tokens with 0.8 memory utilization for context. With Flashinfer backend KV cache will fit ~170K tokens without quantizing to fp8!!!

Here is how you can run with prefix caching enabled. vLLM says that prefix caching support for this architecture is experimental, but it seems to work OK:

Using GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

./launch-cluster.sh --solo \
exec vllm serve Qwen/Qwen3-Coder-Next-FP8 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--gpu-memory-utilization 0.8 \
	--host 0.0.0.0 --port 8888 \
	--load-format fastsafetensors \
	--attention-backend flashinfer \
	--enable-prefix-caching

Benchmarks (these are with FLASH_ATTN backend, I’m running them again with FLASHINFER, but shouldn’t differ too much):

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3-Coder-Next-FP8 pp2048 3006.54 ± 72.99 683.87 ± 16.66 681.47 ± 16.66 683.90 ± 16.65
Qwen/Qwen3-Coder-Next-FP8 tg128 42.68 ± 0.57
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d4096 3019.83 ± 81.96 1359.78 ± 37.52 1357.39 ± 37.52 1359.80 ± 37.52
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d4096 42.84 ± 0.14
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d4096 2368.35 ± 46.78 867.47 ± 17.30 865.08 ± 17.30 867.51 ± 17.30
Qwen/Qwen3-Coder-Next-FP8 tg128 @ d4096 42.12 ± 0.40
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d8192 3356.63 ± 32.43 2443.17 ± 23.69 2440.77 ± 23.69 2443.21 ± 23.68
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d8192 41.97 ± 0.05
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d8192 2723.63 ± 22.21 754.38 ± 6.12 751.99 ± 6.12 754.41 ± 6.12
Qwen/Qwen3-Coder-Next-FP8 tg128 @ d8192 41.56 ± 0.12
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d16384 3255.68 ± 17.66 5034.97 ± 27.35 5032.58 ± 27.35 5035.02 ± 27.35
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d16384 40.44 ± 0.26
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d16384 2502.11 ± 49.83 821.22 ± 16.12 818.83 ± 16.12 821.26 ± 16.12
Qwen/Qwen3-Coder-Next-FP8 tg128 @ d16384 40.22 ± 0.03
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d32768 3076.52 ± 12.46 10653.55 ± 43.19 10651.16 ± 43.19 10653.61 ± 43.19
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d32768 37.93 ± 0.04
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d32768 2161.97 ± 18.51 949.75 ± 8.12 947.36 ± 8.12 949.78 ± 8.12
Qwen/Qwen3-Coder-Next-FP8 tg128 @ d32768 37.20 ± 0.36

llama-benchy (0.1.2)
date: 2026-02-03 10:50:37 | latency mode: api

Now, for comparison, here is what happens if you don’t specify --enable-prefix-caching in vLLM parameters:

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3-Coder-Next-FP8 pp2048 3743.54 ± 28.64 550.02 ± 4.17 547.11 ± 4.17 550.06 ± 4.18
Qwen/Qwen3-Coder-Next-FP8 tg128 44.63 ± 0.05
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d4096 3819.92 ± 28.92 1075.25 ± 8.14 1072.34 ± 8.14 1075.29 ± 8.15
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d4096 44.15 ± 0.09
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d4096 1267.04 ± 13.75 1619.46 ± 17.59 1616.55 ± 17.59 1619.49 ± 17.59
Qwen/Qwen3-Coder-Next-FP8 tg128 @ d4096 43.41 ± 0.38
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d8192 3723.15 ± 29.73 2203.34 ± 17.48 2200.43 ± 17.48 2203.38 ± 17.48
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d8192 43.14 ± 0.07
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d8192 737.40 ± 3.90 2780.31 ± 14.71 2777.40 ± 14.71 2780.35 ± 14.72
Qwen/Qwen3-Coder-Next-FP8 tg128 @ d8192 42.71 ± 0.04
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d16384 3574.05 ± 11.74 4587.12 ± 15.02 4584.21 ± 15.02 4587.15 ± 15.01
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d16384 41.52 ± 0.03
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d16384 393.58 ± 0.69 5206.47 ± 9.16 5203.56 ± 9.16 5214.69 ± 20.61
Qwen/Qwen3-Coder-Next-FP8 tg128 @ d16384 41.09 ± 0.01
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d32768 3313.36 ± 0.57 9892.57 ± 1.69 9889.66 ± 1.69 9892.61 ± 1.69
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d32768 38.82 ± 0.04
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d32768 193.06 ± 0.12 10610.91 ± 6.33 10608.00 ± 6.33 10610.94 ± 6.34
Qwen/Qwen3-Coder-Next-FP8 tg128 @ d32768 38.47 ± 0.02

llama-benchy (0.1.2)
date: 2026-02-03 11:14:29 | latency mode: api

As you can see, follow up requests are much slower this way, because there 0% cache hits.

6 Likes

I assume there are performance gains to be had with cluster (plus can bump to max context)?

Impressive figures.

Yeah, I’m going to benchmark on dual Sparks now :)
Unlike NVFP4, FP8 pathways work really well on Sparks :)

I’ll monitor this thread :)

I don’t know if you’ve seen ngram-mod or not, but it really can make LLMs fly in certain iterative coding tasks, which I also feel come up in agentic workflows where an LLM reads a file then modifies it.

I briefly tested it under llama.cpp against this model earlier, and it was flying… and then it crashed. There is talk of disabling ngram-mod for this model.

If they fix ngram-mod for Qwen3-Next models, then it would be a hard choice between vLLM and llama-server at that point. I think vLLM needs to consider implementing this same feature.

vLLM has some kind of “suffix decoding” specdec via “arctic-inference” which might be similar, but I haven’t tried it, and the fact that I’ve really never heard anyone mention it doesn’t inspire much confidence, but maybe it is great.

Well, when running in the cluster, vLLM crashes with:

RuntimeError: Kernel requires a runtime memory allocation, but no allocator was set. Use triton.set_allocator to specify an allocator.

I’ll rebuild the image using the most recent vLLM commit and try again.

It sounds similar to spec decoding for some of the models in vLLM, like GLM-4.7. I’m on the fence for those ones. The performance becomes very uneven. Sometimes it performs faster, but then slows down, so on average it’s pretty much the same. Haven’t tried llama.cpp implementation though.

I feel like even with this feature, vLLM will still be ahead for coding/agentic flows because of generally much faster prompt processing.

This speculation is based on the previous history of the conversation, not a small decoder head or draft model. The video in the PR shows how crazy fast this can be, because it’s not predicting a couple of tokens ahead, it is predicting dozens of tokens ahead.

1 Like

For batch size 1 tasks, predicting only a few tokens ahead never gives any real speedup with MoE models because you’re still so constrained by bandwidth, but you’ve seen how much faster prompt processing is than token generation, because there is a breakeven point where you’re much faster even for batch size 1.

Ah, OK, that makes sense. Interesting!

Nice Post, thank you for this. How much memory does this take up with KV cache. Interested to see what else I could run at the same time for a specialist coding stack on Single Spark.

Many thanks,

Mark

I’m running at 0.8 memory utilization, so ~92GB. We need to wait for AWQ/FP4 quants to be able to fit into a smaller memory footprint (and also make it run 2x faster).

No, fresh build didn’t help. Looks like there a bug in Triton implementation. I tried to force Flashinfer CUTLASS MOE, but it failed with NotImplementedError: Found VLLM_USE_FLASHINFER_MOE_FP8=1, but no FlashInfer FP8 MoE backend supports the configuration.

I guess I need to build with this PR: feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix · Pull Request #31740 · vllm-project/vllm · GitHub

Still getting the same error. Will see what can be done later…

Well, even NVFP4 quants don’t work in the cluster. The only thing that makes it work with two nodes is to use --enforce-eager, but that kills performance, so it’s worse than a single node. Setting up an allocator like suggested in the error message didn’t work either, but I guess Triton initialization is a bit more complex, so that needs more troubleshooting, and I don’t have time for that.

You give the AWQ version of it by bullpoint a try on your dual box setup:

First thing I tested at work on a H100 NVL this morning to make by colleagues happy.

Next up: comparison of the GGUF with llama.cpp vs. vLLM AWQ on my Spark - GGUF is still pouring through my line.

Downloading now :)

1 Like

Testing out now. Will respond back shortly!

@eugr what would be awesome is a way to document benchmarks for specific models and setups.

Maybe in your /spark-vllm-docker docs or having a shared sheet with models and benchmarks similar to how you posted in this thread. This helps out a ton.

Model name Cluster (t/s) Single (t/s) Comment
Qwen/Qwen3-VL-32B-Instruct-FP8 12.00 7.00
cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit 21.00 12.00
GPT-OSS-120B 55.00 36.00 SGLang gives 75/53
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 21.00 N/A
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ 26.00 N/A
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 65.00 52.00
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ 97.00 82.00
RedHatAI/Qwen3-30B-A3B-NVFP4 75.00 64.00
QuantTrio/MiniMax-M2-AWQ 41.00 N/A
QuantTrio/GLM-4.6-AWQ 17.00 N/A
zai-org/GLM-4.6V-FP8 24.00 N/A

To take it a step further, have the specific ./build-and-copy.sh and ./launch-cluster specific command runs that worked with each.

The reason being is that certain build & launch parameters may work at one point, but may not work further at a later date (using nightly builds / wheels, etc).

It would also allow us to help out in fine tuning and pushing benchmarks past what the current posted (t/s)

Yes, I’m actually working on it. I have a lot of notes in different places, trying to organize them now.
There is also a PR by @raphael.amorim that we are working on merging that adds “model recipes” - launch templates that allow almost “one-click” launching of models.

Well, unfortunately it gives the same triton.allocator error on my system.

I wonder if it’s somehow connected to the fastsafetensors workaround that I’m using for cluster setups. I’ll try to build without it and see if it works.