HOW-TO: Run Qwen3-Coder-Next on Spark

Nope, reverting fastsafetensors patch didn’t help either. Looks like it’s a bug in the custom Triton code that is used by this model that only manifests when running in Ray environment, and possibly on DGX Spark only. And this code is getting executed regardless of the attention or MoE backend too.

I’ll probably open an issue in vLLM for that if I don’t forget - can’t spend any more time on this model now…

BTW, just merged that PR. We will work on populating the recipes - right now there are only few of them there.

1 Like

unsloth has a new dynamic one:

I did a quick run (single Spark):

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
unsloth/Qwen3-Coder-Next-FP8-Dynamic pp2048 2441.67 ± 0.00 930.05 ± 0.00 838.77 ± 0.00 930.15 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic tg128 32.07 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_pp @ d4096 2216.34 ± 0.00 1939.37 ± 0.00 1848.09 ± 0.00 1939.47 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_tg @ d4096 31.81 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic pp2048 @ d4096 1759.44 ± 0.00 1255.29 ± 0.00 1164.01 ± 0.00 1255.38 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic tg128 @ d4096 31.46 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_pp @ d8192 2432.24 ± 0.00 3459.38 ± 0.00 3368.09 ± 0.00 3459.48 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_tg @ d8192 31.15 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic pp2048 @ d8192 2260.20 ± 0.00 997.40 ± 0.00 906.12 ± 0.00 997.48 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic tg128 @ d8192 30.82 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_pp @ d16384 2436.46 ± 0.00 6815.80 ± 0.00 6724.51 ± 0.00 6815.86 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_tg @ d16384 30.11 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic pp2048 @ d16384 1926.10 ± 0.00 1154.57 ± 0.00 1063.29 ± 0.00 1154.65 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic tg128 @ d16384 29.91 ± 0.00

Interesting, it performs slower than the official FP8 version.

1 Like

I’ve been testing Qwen3-Coder-Next and it works really well overall. In particular, OpenClaw has been very useful — on a single node it honestly feels like it flies.

It would be very interesting to see how it performs on two nodes and how it scales compared to a single Spark setup. If anyone has already tested it in a multi-node configuration, I’d be curious to hear about the results or setup details.

Thanks for posting this one, I’m interested in testing out the model quality. I’m seeing similar performance, but here are the results up to 100K context. I’m using your GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks rebuilt today:

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3-Coder-Next-FP8 pp2048 3396.60 ± 76.40 684.18 ± 13.45 603.26 ± 13.45 684.30 ± 13.43
Qwen/Qwen3-Coder-Next-FP8 tg32 43.98 ± 0.15
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d4096 3217.89 ± 119.05 1355.59 ± 48.42 1274.67 ± 48.42 1355.73 ± 48.39
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d4096 43.31 ± 0.04
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d4096 2580.88 ± 44.90 874.69 ± 13.93 793.77 ± 13.93 874.80 ± 13.94
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d4096 42.90 ± 0.16
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d8192 3532.87 ± 27.19 2399.85 ± 17.79 2318.93 ± 17.79 2400.00 ± 17.81
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d8192 42.45 ± 0.02
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d8192 3013.17 ± 133.09 761.96 ± 30.81 681.04 ± 30.81 762.10 ± 30.85
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d8192 42.10 ± 0.04
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d16384 3391.03 ± 2.93 4912.50 ± 4.17 4831.58 ± 4.17 4912.65 ± 4.16
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d16384 40.80 ± 0.07
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d16384 2846.79 ± 46.02 800.51 ± 11.50 719.59 ± 11.50 800.61 ± 11.49
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d16384 38.28 ± 2.93
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d32768 3137.26 ± 13.34 10525.78 ± 44.39 10444.86 ± 44.39 10525.91 ± 44.39
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d32768 37.96 ± 0.06
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d32768 1973.59 ± 466.17 1193.09 ± 315.29 1112.17 ± 315.29 1193.20 ± 315.28
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d32768 37.52 ± 0.05
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d65535 2754.67 ± 5.44 23871.52 ± 46.98 23790.60 ± 46.98 23871.65 ± 46.98
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d65535 33.37 ± 0.10
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d65535 1592.82 ± 16.47 1366.82 ± 13.21 1285.91 ± 13.21 1366.92 ± 13.23
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d65535 33.14 ± 0.11
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d100000 2410.39 ± 5.73 41568.30 ± 98.69 41487.38 ± 98.69 41568.49 ± 98.66
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d100000 29.63 ± 0.06
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d100000 1189.18 ± 21.12 1803.66 ± 30.22 1722.74 ± 30.22 1803.77 ± 30.21
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d100000 29.41 ± 0.10

llama-benchy (0.1.1)
date: 2026-02-05 01:03:20 | latency mode: generation

1 Like

Thanks for the post and github repo for vllm container. Got this model working on a single spark machine. how do I measure performance in terms of tokens/s. Logs in the server show different tokens/s for a taks I gave. Does anyone know what is the average token/s claude code opus does with API

1 Like

FYI: I submitted a bug to vLLM team: [Bug]: Qwen3-Coder-Next fails with Triton allocator error on DGX Spark cluster (GB10, sm121) · Issue #33857 · vllm-project/vllm · GitHub

1 Like

Looks great @eugr. Good work.

Is it possible to add –load-format to the list of possible overrides in recipes?

I can never get fastsafetensors to work. Is there something I am missing there?

I always get the UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True error

Also, I owe you a beer. The –eth-if & –ib-ifsaved my life. I have another subnet going between my PC & Sparks and couldn’t get anything to load. But once I figured out I could plug those variables in, was a huge weight off my shoulders. Appreciate it!

I’m going to try and see if I can cluster My Threadripper PC with 2X 5090 with the 2X Sparks. It only has 100GB ConnectX-5 though, so I am not sure if it has the juice.

Does the model load? This message is normal and expected on Spark as it doesn’t support GDS. Even without GDS, fastsafetensors are much faster.

Yeah, it’s a good idea, can you open an issue in the tracker, so we don’t forget?

1 Like