HOW-TO: Run Qwen3-Coder-Next on Spark

eugr · February 3, 2026, 7:49pm

Qwen just released their new coding model - Qwen3-Coder-Next.

Good news is that native FP8 version is supported out of the box in our community Docker and performs reasonably well at ~43 t/s on a single Spark.

Please note that if you launch with parameters on the model card, vLLM will disable prefix caching which will really affect any coding workflows due to prompt re-processing at each request. Also, by default it uses FLASH_ATTN backend which will allow only ~60K tokens with 0.8 memory utilization for context. With Flashinfer backend KV cache will fit ~170K tokens without quantizing to fp8!!!

Here is how you can run with prefix caching enabled. vLLM says that prefix caching support for this architecture is experimental, but it seems to work OK:

Using GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

./launch-cluster.sh --solo \
exec vllm serve Qwen/Qwen3-Coder-Next-FP8 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--gpu-memory-utilization 0.8 \
	--host 0.0.0.0 --port 8888 \
	--load-format fastsafetensors \
	--attention-backend flashinfer \
	--enable-prefix-caching

Benchmarks (these are with FLASH_ATTN backend, I’m running them again with FLASHINFER, but shouldn’t differ too much):

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3-Coder-Next-FP8	pp2048	3006.54 ± 72.99	683.87 ± 16.66	681.47 ± 16.66	683.90 ± 16.65
Qwen/Qwen3-Coder-Next-FP8	tg128	42.68 ± 0.57
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d4096	3019.83 ± 81.96	1359.78 ± 37.52	1357.39 ± 37.52	1359.80 ± 37.52
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d4096	42.84 ± 0.14
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d4096	2368.35 ± 46.78	867.47 ± 17.30	865.08 ± 17.30	867.51 ± 17.30
Qwen/Qwen3-Coder-Next-FP8	tg128 @ d4096	42.12 ± 0.40
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d8192	3356.63 ± 32.43	2443.17 ± 23.69	2440.77 ± 23.69	2443.21 ± 23.68
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d8192	41.97 ± 0.05
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d8192	2723.63 ± 22.21	754.38 ± 6.12	751.99 ± 6.12	754.41 ± 6.12
Qwen/Qwen3-Coder-Next-FP8	tg128 @ d8192	41.56 ± 0.12
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d16384	3255.68 ± 17.66	5034.97 ± 27.35	5032.58 ± 27.35	5035.02 ± 27.35
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d16384	40.44 ± 0.26
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d16384	2502.11 ± 49.83	821.22 ± 16.12	818.83 ± 16.12	821.26 ± 16.12
Qwen/Qwen3-Coder-Next-FP8	tg128 @ d16384	40.22 ± 0.03
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d32768	3076.52 ± 12.46	10653.55 ± 43.19	10651.16 ± 43.19	10653.61 ± 43.19
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d32768	37.93 ± 0.04
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d32768	2161.97 ± 18.51	949.75 ± 8.12	947.36 ± 8.12	949.78 ± 8.12
Qwen/Qwen3-Coder-Next-FP8	tg128 @ d32768	37.20 ± 0.36

llama-benchy (0.1.2)
date: 2026-02-03 10:50:37 | latency mode: api

Now, for comparison, here is what happens if you don’t specify --enable-prefix-caching in vLLM parameters:

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3-Coder-Next-FP8	pp2048	3743.54 ± 28.64	550.02 ± 4.17	547.11 ± 4.17	550.06 ± 4.18
Qwen/Qwen3-Coder-Next-FP8	tg128	44.63 ± 0.05
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d4096	3819.92 ± 28.92	1075.25 ± 8.14	1072.34 ± 8.14	1075.29 ± 8.15
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d4096	44.15 ± 0.09
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d4096	1267.04 ± 13.75	1619.46 ± 17.59	1616.55 ± 17.59	1619.49 ± 17.59
Qwen/Qwen3-Coder-Next-FP8	tg128 @ d4096	43.41 ± 0.38
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d8192	3723.15 ± 29.73	2203.34 ± 17.48	2200.43 ± 17.48	2203.38 ± 17.48
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d8192	43.14 ± 0.07
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d8192	737.40 ± 3.90	2780.31 ± 14.71	2777.40 ± 14.71	2780.35 ± 14.72
Qwen/Qwen3-Coder-Next-FP8	tg128 @ d8192	42.71 ± 0.04
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d16384	3574.05 ± 11.74	4587.12 ± 15.02	4584.21 ± 15.02	4587.15 ± 15.01
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d16384	41.52 ± 0.03
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d16384	393.58 ± 0.69	5206.47 ± 9.16	5203.56 ± 9.16	5214.69 ± 20.61
Qwen/Qwen3-Coder-Next-FP8	tg128 @ d16384	41.09 ± 0.01
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d32768	3313.36 ± 0.57	9892.57 ± 1.69	9889.66 ± 1.69	9892.61 ± 1.69
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d32768	38.82 ± 0.04
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d32768	193.06 ± 0.12	10610.91 ± 6.33	10608.00 ± 6.33	10610.94 ± 6.34
Qwen/Qwen3-Coder-Next-FP8	tg128 @ d32768	38.47 ± 0.02

llama-benchy (0.1.2)
date: 2026-02-03 11:14:29 | latency mode: api

As you can see, follow up requests are much slower this way, because there 0% cache hits.

jaim12005 · February 3, 2026, 8:11pm

I assume there are performance gains to be had with cluster (plus can bump to max context)?

Impressive figures.

eugr · February 3, 2026, 8:12pm

Yeah, I’m going to benchmark on dual Sparks now :)
Unlike NVFP4, FP8 pathways work really well on Sparks :)

jaim12005 · February 3, 2026, 8:13pm

I’ll monitor this thread :)

coder543 · February 3, 2026, 8:22pm

I don’t know if you’ve seen ngram-mod or not, but it really can make LLMs fly in certain iterative coding tasks, which I also feel come up in agentic workflows where an LLM reads a file then modifies it.

I briefly tested it under llama.cpp against this model earlier, and it was flying… and then it crashed. There is talk of disabling ngram-mod for this model.

If they fix ngram-mod for Qwen3-Next models, then it would be a hard choice between vLLM and llama-server at that point. I think vLLM needs to consider implementing this same feature.

vLLM has some kind of “suffix decoding” specdec via “arctic-inference” which might be similar, but I haven’t tried it, and the fact that I’ve really never heard anyone mention it doesn’t inspire much confidence, but maybe it is great.

eugr · February 3, 2026, 8:47pm

Well, when running in the cluster, vLLM crashes with:

RuntimeError: Kernel requires a runtime memory allocation, but no allocator was set. Use triton.set_allocator to specify an allocator.

I’ll rebuild the image using the most recent vLLM commit and try again.

eugr · February 3, 2026, 8:52pm

It sounds similar to spec decoding for some of the models in vLLM, like GLM-4.7. I’m on the fence for those ones. The performance becomes very uneven. Sometimes it performs faster, but then slows down, so on average it’s pretty much the same. Haven’t tried llama.cpp implementation though.

I feel like even with this feature, vLLM will still be ahead for coding/agentic flows because of generally much faster prompt processing.

coder543 · February 3, 2026, 8:56pm

This speculation is based on the previous history of the conversation, not a small decoder head or draft model. The video in the PR shows how crazy fast this can be, because it’s not predicting a couple of tokens ahead, it is predicting dozens of tokens ahead.

coder543 · February 3, 2026, 8:58pm

For batch size 1 tasks, predicting only a few tokens ahead never gives any real speedup with MoE models because you’re still so constrained by bandwidth, but you’ve seen how much faster prompt processing is than token generation, because there is a breakeven point where you’re much faster even for batch size 1.

eugr · February 3, 2026, 8:58pm

Ah, OK, that makes sense. Interesting!

griffith.mark · February 3, 2026, 9:12pm

Nice Post, thank you for this. How much memory does this take up with KV cache. Interested to see what else I could run at the same time for a specialist coding stack on Single Spark.

Many thanks,

Mark

eugr · February 3, 2026, 9:25pm

I’m running at 0.8 memory utilization, so ~92GB. We need to wait for AWQ/FP4 quants to be able to fit into a smaller memory footprint (and also make it run 2x faster).

eugr · February 3, 2026, 9:59pm

No, fresh build didn’t help. Looks like there a bug in Triton implementation. I tried to force Flashinfer CUTLASS MOE, but it failed with NotImplementedError: Found VLLM_USE_FLASHINFER_MOE_FP8=1, but no FlashInfer FP8 MoE backend supports the configuration.

I guess I need to build with this PR: feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix · Pull Request #31740 · vllm-project/vllm · GitHub

eugr · February 3, 2026, 10:31pm

Still getting the same error. Will see what can be done later…

eugr · February 4, 2026, 7:27am

Well, even NVFP4 quants don’t work in the cluster. The only thing that makes it work with two nodes is to use --enforce-eager, but that kills performance, so it’s worse than a single node. Setting up an allocator like suggested in the error message didn’t work either, but I guess Triton initialization is a bit more complex, so that needs more troubleshooting, and I don’t have time for that.

cosinus · February 4, 2026, 5:55pm

You give the AWQ version of it by bullpoint a try on your dual box setup:

First thing I tested at work on a H100 NVL this morning to make by colleagues happy.

Next up: comparison of the GGUF with llama.cpp vs. vLLM AWQ on my Spark - GGUF is still pouring through my line.

eugr · February 4, 2026, 6:13pm

Downloading now :)

Keyper-AI · February 4, 2026, 6:17pm

Testing out now. Will respond back shortly!

@eugr what would be awesome is a way to document benchmarks for specific models and setups.

Maybe in your /spark-vllm-docker docs or having a shared sheet with models and benchmarks similar to how you posted in this thread. This helps out a ton.

Model name Cluster (t/s) Single (t/s) Comment
Qwen/Qwen3-VL-32B-Instruct-FP8 12.00 7.00
cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit 21.00 12.00
GPT-OSS-120B 55.00 36.00 SGLang gives 75/53
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 21.00 N/A
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ 26.00 N/A
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 65.00 52.00
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ 97.00 82.00
RedHatAI/Qwen3-30B-A3B-NVFP4 75.00 64.00
QuantTrio/MiniMax-M2-AWQ 41.00 N/A
QuantTrio/GLM-4.6-AWQ 17.00 N/A
zai-org/GLM-4.6V-FP8 24.00 N/A

To take it a step further, have the specific ./build-and-copy.sh and ./launch-cluster specific command runs that worked with each.

The reason being is that certain build & launch parameters may work at one point, but may not work further at a later date (using nightly builds / wheels, etc).

It would also allow us to help out in fine tuning and pushing benchmarks past what the current posted (t/s)

eugr · February 4, 2026, 6:34pm

Yes, I’m actually working on it. I have a lot of notes in different places, trying to organize them now.
There is also a PR by @raphael.amorim that we are working on merging that adds “model recipes” - launch templates that allow almost “one-click” launching of models.

eugr · February 4, 2026, 6:35pm

Well, unfortunately it gives the same triton.allocator error on my system.

I wonder if it’s somehow connected to the fastsafetensors workaround that I’m using for cluster setups. I’ll try to build without it and see if it works.

Topic		Replies	Views
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	10	1462	January 25, 2026
Running Step-3.5-Flash on Single Spark DGX Spark / GB10 Projects jetson , llama	20	922	February 9, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	933	January 7, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	1518	December 31, 2025
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	383	December 19, 2025
DGX Spark performance DGX Spark / GB10	45	1625	February 10, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	3662	December 9, 2025
Some new development work for Qwen3 on the Spark DGX Spark / GB10	5	333	February 3, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	24	1219	January 11, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	2720	January 2, 2026

HOW-TO: Run Qwen3-Coder-Next on Spark

Related topics