HOW-TO: Run Qwen3-Coder-Next on Spark

eugr · February 4, 2026, 7:10pm

Nope, reverting fastsafetensors patch didn’t help either. Looks like it’s a bug in the custom Triton code that is used by this model that only manifests when running in Ray environment, and possibly on DGX Spark only. And this code is getting executed regardless of the attention or MoE backend too.

I’ll probably open an issue in vLLM for that if I don’t forget - can’t spend any more time on this model now…

eugr · February 4, 2026, 8:11pm

BTW, just merged that PR. We will work on populating the recipes - right now there are only few of them there.

angelespiritu · February 4, 2026, 8:18pm

unsloth has a new dynamic one:

I did a quick run (single Spark):

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
unsloth/Qwen3-Coder-Next-FP8-Dynamic	pp2048	2441.67 ± 0.00	930.05 ± 0.00	838.77 ± 0.00	930.15 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	tg128	32.07 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	ctx_pp @ d4096	2216.34 ± 0.00	1939.37 ± 0.00	1848.09 ± 0.00	1939.47 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	ctx_tg @ d4096	31.81 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	pp2048 @ d4096	1759.44 ± 0.00	1255.29 ± 0.00	1164.01 ± 0.00	1255.38 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	tg128 @ d4096	31.46 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	ctx_pp @ d8192	2432.24 ± 0.00	3459.38 ± 0.00	3368.09 ± 0.00	3459.48 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	ctx_tg @ d8192	31.15 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	pp2048 @ d8192	2260.20 ± 0.00	997.40 ± 0.00	906.12 ± 0.00	997.48 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	tg128 @ d8192	30.82 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	ctx_pp @ d16384	2436.46 ± 0.00	6815.80 ± 0.00	6724.51 ± 0.00	6815.86 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	ctx_tg @ d16384	30.11 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	pp2048 @ d16384	1926.10 ± 0.00	1154.57 ± 0.00	1063.29 ± 0.00	1154.65 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic	tg128 @ d16384	29.91 ± 0.00

eugr · February 4, 2026, 9:04pm

Interesting, it performs slower than the official FP8 version.

c.molina · February 4, 2026, 10:03pm

I’ve been testing Qwen3-Coder-Next and it works really well overall. In particular, OpenClaw has been very useful — on a single node it honestly feels like it flies.

It would be very interesting to see how it performs on two nodes and how it scales compared to a single Spark setup. If anyone has already tested it in a multi-node configuration, I’d be curious to hear about the results or setup details.

mmos · February 5, 2026, 1:09am

Thanks for posting this one, I’m interested in testing out the model quality. I’m seeing similar performance, but here are the results up to 100K context. I’m using your GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks rebuilt today:

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3-Coder-Next-FP8	pp2048	3396.60 ± 76.40	684.18 ± 13.45	603.26 ± 13.45	684.30 ± 13.43
Qwen/Qwen3-Coder-Next-FP8	tg32	43.98 ± 0.15
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d4096	3217.89 ± 119.05	1355.59 ± 48.42	1274.67 ± 48.42	1355.73 ± 48.39
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d4096	43.31 ± 0.04
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d4096	2580.88 ± 44.90	874.69 ± 13.93	793.77 ± 13.93	874.80 ± 13.94
Qwen/Qwen3-Coder-Next-FP8	tg32 @ d4096	42.90 ± 0.16
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d8192	3532.87 ± 27.19	2399.85 ± 17.79	2318.93 ± 17.79	2400.00 ± 17.81
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d8192	42.45 ± 0.02
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d8192	3013.17 ± 133.09	761.96 ± 30.81	681.04 ± 30.81	762.10 ± 30.85
Qwen/Qwen3-Coder-Next-FP8	tg32 @ d8192	42.10 ± 0.04
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d16384	3391.03 ± 2.93	4912.50 ± 4.17	4831.58 ± 4.17	4912.65 ± 4.16
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d16384	40.80 ± 0.07
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d16384	2846.79 ± 46.02	800.51 ± 11.50	719.59 ± 11.50	800.61 ± 11.49
Qwen/Qwen3-Coder-Next-FP8	tg32 @ d16384	38.28 ± 2.93
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d32768	3137.26 ± 13.34	10525.78 ± 44.39	10444.86 ± 44.39	10525.91 ± 44.39
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d32768	37.96 ± 0.06
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d32768	1973.59 ± 466.17	1193.09 ± 315.29	1112.17 ± 315.29	1193.20 ± 315.28
Qwen/Qwen3-Coder-Next-FP8	tg32 @ d32768	37.52 ± 0.05
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d65535	2754.67 ± 5.44	23871.52 ± 46.98	23790.60 ± 46.98	23871.65 ± 46.98
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d65535	33.37 ± 0.10
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d65535	1592.82 ± 16.47	1366.82 ± 13.21	1285.91 ± 13.21	1366.92 ± 13.23
Qwen/Qwen3-Coder-Next-FP8	tg32 @ d65535	33.14 ± 0.11
Qwen/Qwen3-Coder-Next-FP8	ctx_pp @ d100000	2410.39 ± 5.73	41568.30 ± 98.69	41487.38 ± 98.69	41568.49 ± 98.66
Qwen/Qwen3-Coder-Next-FP8	ctx_tg @ d100000	29.63 ± 0.06
Qwen/Qwen3-Coder-Next-FP8	pp2048 @ d100000	1189.18 ± 21.12	1803.66 ± 30.22	1722.74 ± 30.22	1803.77 ± 30.21
Qwen/Qwen3-Coder-Next-FP8	tg32 @ d100000	29.41 ± 0.10

llama-benchy (0.1.1)
date: 2026-02-05 01:03:20 | latency mode: generation

arulkumaravel · February 5, 2026, 1:17am

Thanks for the post and github repo for vllm container. Got this model working on a single spark machine. how do I measure performance in terms of tokens/s. Logs in the server show different tokens/s for a taks I gave. Does anyone know what is the average token/s claude code opus does with API

eugr · February 5, 2026, 2:17am

FYI: I submitted a bug to vLLM team: [Bug]: Qwen3-Coder-Next fails with Triton allocator error on DGX Spark cluster (GB10, sm121) · Issue #33857 · vllm-project/vllm · GitHub

Keyper-AI · February 5, 2026, 8:59am

Looks great @eugr. Good work.

Is it possible to add –load-format to the list of possible overrides in recipes?

I can never get fastsafetensors to work. Is there something I am missing there?

I always get the UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True error

Also, I owe you a beer. The –eth-if & –ib-ifsaved my life. I have another subnet going between my PC & Sparks and couldn’t get anything to load. But once I figured out I could plug those variables in, was a huge weight off my shoulders. Appreciate it!

I’m going to try and see if I can cluster My Threadripper PC with 2X 5090 with the 2X Sparks. It only has 100GB ConnectX-5 though, so I am not sure if it has the juice.

eugr · February 5, 2026, 4:58pm

Does the model load? This message is normal and expected on Spark as it doesn’t support GDS. Even without GDS, fastsafetensors are much faster.

Yeah, it’s a good idea, can you open an issue in the tracker, so we don’t forget?

Topic		Replies	Views
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	10	1465	January 25, 2026
Running Step-3.5-Flash on Single Spark DGX Spark / GB10 Projects jetson , llama	20	930	February 9, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	933	January 7, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	1518	December 31, 2025
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	383	December 19, 2025
DGX Spark performance DGX Spark / GB10	45	1631	February 10, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	3666	December 9, 2025
Some new development work for Qwen3 on the Spark DGX Spark / GB10	5	334	February 3, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	24	1219	January 11, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	2722	January 2, 2026

HOW-TO: Run Qwen3-Coder-Next on Spark

Related topics