Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

mmos · February 28, 2026, 6:40pm

I’ll try to answer as best I can without sharing exact project details. For gpt-oss-120b I was using their recommended sampling parameters and for Qwen3.5 I was also using their recommended sampling parameters (Qwen/Qwen3.5-122B-A10B-FP8 · Hugging Face). My use case involves providing around ~60K of text performing analysis on it with a specifically tuned prompt where the model is instructed to use the provided text as the context and not rely on general knowledge it has been trained on.

In specific cases with gpt-oss-120b (w/ high reasoning enabled) and Qwen 3.5 FP8 outputs, it identified key critical details I expected it to pick up on. Using the same exact code/sampling parameters and the intel int4 quant version, it missed them.

This is based on my own testing and your mileage may vary of course but I just wanted to flag it because I figured given the higher AI scores for this model vs gpt-oss-120b that moving down to int4 wouldn’t have a big difference, but it did in my own testing and use cases.

trystan1 · February 28, 2026, 7:08pm

This is interesting to me because a lot of the responses I’ve seen in the forum gravitate towards single batch performance related metrics when I’m more concerned about accuracy like what you’re describing.

NVFP4 was (to my knowledge) designed to improve on 4 bit quantization in terms of accuracy retention. I’d be curious in your testing if an nvfp4 version of the model instead of the int4 autoround flavor behaves differently.

I’d suspect you’d need to use marlin to avoid the crashing, but in my own testing nvfp4 with the recommended sampling parameters performs identically to the fp8 version.

mmos · February 28, 2026, 8:09pm

I just tested unsloth Qwen3.5-122B-A10B-UD-Q4_K_XL with the latest llama.cpp server and the results looked very similar to me as the intel 4bit quant version so I’m not sure. It could be my specific use cases but for me I think the conclusion I’m coming to is to run the highest FP8 native quant in VLLM from the source I can for these models. I spend too much time already tinkering with this I’m not sure I want to compare and test different outputs from different quants from each model.

vedcsolution · February 28, 2026, 9:18pm

Autoround tuned MOE off

|:---------------------------------------|-------------:|----------------:|-------------:|---------------:|---------------:|----------------:|

| Intel/Qwen3.5-122B-A10B-int4-AutoRound | pp2 @ d2048 | 3450.23 ± 22.11 | | 604.03 ± 3.93 | 603.17 ± 3.93 | 604.09 ± 3.94 |

| Intel/Qwen3.5-122B-A10B-int4-AutoRound | tg32 @ d2048 | 39.59 ± 1.36 | 40.87 ± 1.40 | | | |

| Intel/Qwen3.5-122B-A10B-int4-AutoRound | pp2 @ d4096 | 3457.18 ± 15.03 | | 1195.26 ± 5.63 | 1194.41 ± 5.63 | 1195.32 ± 5.63 |

| Intel/Qwen3.5-122B-A10B-int4-AutoRound | tg32 @ d4096 | 40.77 ± 0.15 | 42.09 ± 0.15 | | | |

llama-benchy (0.1.dev90+ge39fc28fb)

date: 2026-02-28 22:12:41 | latency mode: api | | |

Autoround tune MOE on

|:---------------------------------------|-------------:|----------------:|-------------:|---------------:|---------------:|----------------:|

| Intel/Qwen3.5-122B-A10B-int4-AutoRound | pp2 @ d2048 | 3575.35 ± 11.44 | | 594.74 ± 1.96 | 581.82 ± 1.96 | 594.78 ± 1.96 |

| Intel/Qwen3.5-122B-A10B-int4-AutoRound | tg32 @ d2048 | 40.63 ± 0.60 | 42.22 ± 0.22 | | | |

| Intel/Qwen3.5-122B-A10B-int4-AutoRound | pp2 @ d4096 | 3522.17 ± 6.53 | | 1185.32 ± 2.34 | 1172.41 ± 2.34 | 1185.38 ± 2.34 |

| Intel/Qwen3.5-122B-A10B-int4-AutoRound | tg32 @ d4096 | 40.85 ± 0.07 | 42.18 ± 0.08 | | | |

llama-benchy (0.1.dev90+ge39fc28fb)

date: 2026-02-28 22:07:35 | latency mode: api

Minor improvements by MoE distributed kernel tuning (vLLM Triton)

joshua.dale.warner · February 28, 2026, 9:35pm

Thanks for the report. I realize the need for some degree of privacy, but if you can indulge I wonder if there may be some more to be learned.

In my experimentation I’ve seen several different types of failures along these lines and inspection of the logs and thinking traces may be illustrative.

[For completeness, seems not relevant for you but is for me] OCR failure. Perception error, the key data was missing or incorrectly captured, so nothing downstream matters.
KV cache quantization or poor attention mechanisms resulting in outright misses (contextual needle-in-haystack failure or perception error; model intelligence is irrelevant). This came to the fore around Llama 4 and occasionally rears its head. Newer gen models are all using more efficient KV cache, which we like but there will be a limit.
Model noted the connection but discarded or talked itself out of it. Judgement issue. This can definitely be affected by quant; possibly also inference-parameter dependent.
Model noted and expanded on the connection.

Only #4 is a success, but the result can be caused by upstream factors. And, because (assuming temp was not 0) the output is non-deterministic, there is a random effect; a repeated run on int4 could catch it.

vedcsolution · February 28, 2026, 9:41pm

Distributed MoE Kernel Tuning on DGX (FP8 + INT4 AutoRound) with vLLM
What we did
Tuned vLLM MoE kernels in a real 2-node Ray cluster (TP=2, EP=off).
Used benchmark_moe.py in baseline + tune mode.
Generated and deployed tuned JSON configs via VLLM_TUNED_CONFIG_FOLDER.
Added checkpoint/resume + fault-tolerant tuning (skip failing Triton configs, continue from progress).
Built startup mods so tuned configs are injected automatically in containers.
Models covered
Qwen/Qwen3.5-122B-A10B-FP8
Intel/Qwen3.5-122B-A10B-int4-AutoRound
Why this matters
This does not retrain the model.
It optimizes the runtime MoE expert kernel parameters (tiling/warps/stages) for your exact:

GPU architecture (DGX/GB10 class),
precision path (FP8 / INT4),
and serving topology (TP/EP).
Result: better prefill/TTFR stability and throughput in production-like conditions.

Main takeaway
The method is broadly applicable to any vLLM MoE model on DGX, as long as tuning is run with the same runtime settings you will serve with (model, dtype, TP/EP, batch profile, backend).

I haven’t really been able to test the quality of the model because I’ve been doing these optimizations.

mmos · February 28, 2026, 9:51pm

I can confirm I am using Qwen’s recommended settings for all of the tests with Qwen (for gpt-oss-120b I used openai’s recommended settings).

Thinking mode for general tasks:
temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

I reviewed a small number of samples and this is just based on my own opinions on these of significant findings that were identified by gpt-oss-120b, Qwen FP8 that were not with the 4bit variants. It certainly may not be indicative on a wider sample I’m not sure, but again I don’t know how much time I want to spend testing and figuring that out to be honest. I did do two separate runs on intel 4bit and the unsloth variant with the same conclusions.

dngettler · March 1, 2026, 12:05am

Lots of options for a single Spark on Ollama now—looks like it was earlier today: Tags · qwen3.5

raphael.amorim · March 1, 2026, 12:09am

Easy, but still too slow on the Spark unfortunately

dngettler · March 1, 2026, 12:27am

What kind of numbers are you getting? I’m still downloading it, qwen3.5:122b in Q4_K_M (80gb) on the Spark. I’m also downloading qwen3.5:35b in Q4_K_M (24gb) on my 5090 which will obviously be much faster, but the Spark wasn’t built for speed, it was built for capacity.

raphael.amorim · March 1, 2026, 2:01am

Prefill / prompt processing (pp2048, ctx_pp, pp2048 @ dXXXX)

cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit on vLLM is much faster than qwen3.5:122b on Ollama throughput:
- vLLM: ~3.82× higher t/s
- Ollama: ~5.68× higher t/s
A has much lower TTFR (B is slower by):
- vLLM median ~3.46× slower TTFR on B
- Ollama: median ~4.53× slower TTFR on B
- Worst overlap case: pp2048 @ d32768 (c2) → Ollama TTFR is 11.48× slower.

Decode / generation (tg128, ctx_tg, tg128 @ dXXXX)

At c1: The results are similar for small prompts
At c2: vLLM’s total decode throughput becomes much higher at longer contexts

If you increase context and concurrency Ollama is simply useless.

Ollama stats (for vLLM check https://spark-arena.com):

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5:122b	pp2048 (c1)	488.64 ± 2.07	488.64 ± 2.07			3749.63 ± 14.58	3747.77 ± 14.58	3749.63 ± 14.58
qwen3.5:122b	tg128 (c1)	18.85 ± 0.36	18.85 ± 0.36	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	pp2048 (c2)	272.47 ± 3.35	313.79 ± 175.66			8738.26 ± 4973.58	8736.40 ± 4973.58	8738.26 ± 4973.58
qwen3.5:122b	tg128 (c2)	14.04 ± 0.14	18.31 ± 0.83	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	ctx_pp @ d4096 (c1)	505.46 ± 4.94	505.46 ± 4.94			7309.47 ± 272.23	7307.61 ± 272.23	7309.47 ± 272.23
qwen3.5:122b	ctx_tg @ d4096 (c1)	18.86 ± 0.19	18.86 ± 0.19	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	pp2048 @ d4096 (c1)	345.01 ± 7.99	345.01 ± 7.99			5941.12 ± 135.37	5939.26 ± 135.37	5941.12 ± 135.37
qwen3.5:122b	tg128 @ d4096 (c1)	18.95 ± 0.03	18.95 ± 0.03	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	ctx_pp @ d4096 (c2)	346.18 ± 10.75	336.80 ± 164.31			14300.40 ± 6943.56	14298.54 ± 6943.56	14300.40 ± 6943.56
qwen3.5:122b	ctx_tg @ d4096 (c2)	11.55 ± 0.28	18.63 ± 0.04	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	pp2048 @ d4096 (c2)	153.97 ± 15.68	154.54 ± 96.72			18355.41 ± 8960.08	18353.56 ± 8960.08	18355.41 ± 8960.08
qwen3.5:122b	tg128 @ d4096 (c2)	10.07 ± 0.11	18.69 ± 0.20	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	ctx_pp @ d8192 (c1)	509.93 ± 0.46	509.93 ± 0.46			14544.25 ± 155.21	14542.39 ± 155.21	14544.25 ± 155.21
qwen3.5:122b	ctx_tg @ d8192 (c1)	18.63 ± 0.21	18.63 ± 0.21	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	pp2048 @ d8192 (c1)	369.29 ± 11.93	369.29 ± 11.93			5553.43 ± 177.56	5551.57 ± 177.56	5553.43 ± 177.56
qwen3.5:122b	tg128 @ d8192 (c1)	19.01 ± 0.57	19.01 ± 0.57	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	ctx_pp @ d8192 (c2)	422.14 ± 0.60	361.01 ± 151.99			25041.63 ± 10342.97	25039.78 ± 10342.97	25041.63 ± 10342.97
qwen3.5:122b	ctx_tg @ d8192 (c2)	8.74 ± 0.09	18.66 ± 0.21	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	pp2048 @ d8192 (c2)	107.80 ± 18.95	122.90 ± 106.29			26640.42 ± 13834.07	26638.57 ± 13834.07	26640.42 ± 13834.07
qwen3.5:122b	tg128 @ d8192 (c2)	7.54 ± 0.14	18.64 ± 0.44	21.00 ± 0.00	21.00 ± 0.00
qwen3.5:122b	ctx_pp @ d16384 (c1)	509.60 ± 0.39	509.60 ± 0.39			29345.19 ± 157.19	29343.33 ± 157.19	29345.19 ± 157.19
qwen3.5:122b	ctx_tg @ d16384 (c1)	18.62 ± 0.02	18.62 ± 0.02	20.33 ± 0.47	20.33 ± 0.47
qwen3.5:122b	pp2048 @ d16384 (c1)	464.19 ± 29.72	464.19 ± 29.72			4432.87 ± 296.86	4431.02 ± 296.86	4432.87 ± 296.86
qwen3.5:122b	tg128 @ d16384 (c1)	18.04 ± 0.00	18.04 ± 0.00	20.00 ± 0.00	20.00 ± 0.00
qwen3.5:122b	ctx_pp @ d16384 (c2)	460.71 ± 1.28	370.05 ± 140.31			46646.49 ± 17578.78	46644.64 ± 17578.78	46646.49 ± 17578.78
qwen3.5:122b	ctx_tg @ d16384 (c2)	5.64 ± 0.04	18.33 ± 0.21	20.67 ± 0.47	20.67 ± 0.47
qwen3.5:122b	pp2048 @ d16384 (c2)	67.09 ± 14.33	79.68 ± 78.58			44077.80 ± 22745.67	44075.94 ± 22745.67	44077.80 ± 22745.67
qwen3.5:122b	tg128 @ d16384 (c2)	5.20 ± 0.02	18.39 ± 0.34	20.00 ± 0.00	20.00 ± 0.00
qwen3.5:122b	ctx_pp @ d32768 (c1)	500.42 ± 1.28	500.42 ± 1.28			59565.81 ± 519.42	59563.95 ± 519.42	59565.81 ± 519.42
qwen3.5:122b	ctx_tg @ d32768 (c1)	17.92 ± 0.31	17.92 ± 0.31	20.00 ± 0.00	20.00 ± 0.00
qwen3.5:122b	pp2048 @ d32768 (c1)	373.27 ± 33.81	373.27 ± 33.81			5533.40 ± 497.03	5531.55 ± 497.03	5533.40 ± 497.03
qwen3.5:122b	tg128 @ d32768 (c1)	17.47 ± 0.30	17.47 ± 0.30	20.00 ± 0.00	20.00 ± 0.00
qwen3.5:122b	ctx_pp @ d32768 (c2)	474.27 ± 1.16	368.90 ± 131.39			92312.66 ± 33010.63	92310.81 ± 33010.63	92312.66 ± 33010.63
qwen3.5:122b	ctx_tg @ d32768 (c2)	3.25 ± 0.01	17.80 ± 0.28	20.00 ± 0.00	20.00 ± 0.00
qwen3.5:122b	pp2048 @ d32768 (c2)	38.09 ± 10.24	76.90 ± 117.60			79299.69 ± 44005.78	79297.83 ± 44005.78	79299.69 ± 44005.78
qwen3.5:122b	tg128 @ d32768 (c2)	3.04 ± 0.09	17.38 ± 0.72	20.00 ± 0.00	20.00 ± 0.00

ekkis · March 1, 2026, 6:04am

I’m using the 122b int4 autoround with Opencode and getting about 15-25 tokens/s in general, prefix caching is essential. I’m also using the 35b q4 xl on a undervolted 4090 and it’s blazing fast, ~120 tokens/s with the same Opencode work. It’s so fast it’s impossible to track the output while it’s working :)

Still having issues with the model on the gx10 just stopping work mid investigation with no feedback, starting to think it’s a vllm issue. The 35b running on llama.cpp has no such issues.

Update: I had openclaw running codex do some investigation, and it concluded that thinking mode was the culprit, and after disabling it via my litellm config it does appear to be working fine, no sudden stops. Codex suggests the cause is too much memory pressure from extended reasoning output:

[openclaw] Most likely root cause: with thinking enabled, the gx10/vLLM path sometimes spends the completion budget on reasoning tokens and hits finish_reason:"length" before emitting final text — so clients see content:null and it feels like a sudden silent stop.

Why it appears “after a few minutes”: session context grows, prompts get heavier, reasoning expands, and you cross a token/format threshold where this failure mode starts.

Dickson · March 1, 2026, 8:37am

i’ve been using the 122b and 35b in openwebui to do some coding test and it goes into a death loop after 2 turns of thinking/generating code. It does this in both llama.cpp and vllm for me.

whpthomas · March 1, 2026, 11:00am

Interesting – I am finding the opposite with with Qwen/Qwen3-Coder-Next-FP8 vs Intel/Qwen3-Coder-Next-int4-AutoRound even with both using FP16 K V cache the AutoRound version seems ‘smarter’ to me. I often ask the same question in Opencode, on the same source snapshot and look at the output and I am finding greater accuracy with AutoRound. This is a subjective judgement I know, but I just did a another run to confirm this and FP8 could not solve the same problem with 5 prompts (1 question + 4 hints), AutoRound solved it with 2 (1 question + 1 correction).

gpieceoffice · March 1, 2026, 11:38am

Hello, has anyone tried running vLLM bench using the non-quantized version of Qwen3.5-27B on a single Spark? I’m only getting about 4.5–5 tok/s, and even assuming it’s a dense model, it seems too slow.

cho · March 1, 2026, 11:55am

cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit · Hugging Face
( eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks)

docker run --gpus all -it --rm \
  -p 8000:8000 \
  -v /home/xfusion/Downloads/vllm/models:/model \
  --name vllm-node-tf5 \
  vllm-node \
  vllm serve /model/Qwen3.5-122B-A10B-AWQ-4bit \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel 1 \
    --gpu-memory-utilization 0.8 \
    --kv-cache-dtype fp8 \
    --load-format fastsafetensors \
    --attention-backend flashinfer \
    --max-model-len 262144 \
    --max-num-batched-tokens 8192 \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

llama-benchy --base-url http://127.0.0.1:8000/v1 --model /model/Qwen3.5-122B-A10B-AWQ-4bit --tokenizer /home/xfusion/Downloads/vllm/models/Qwen3.5-122B-A10B-AWQ-4bit --pp 512 2048 8192 --tg 32 128 --runs 5

| model                             |   test |              t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:----------------------------------|-------:|-----------------:|-------------:|----------------:|----------------:|----------------:|
| /model/Qwen3.5-122B-A10B-AWQ-4bit |  pp512 | 1201.31 ± 185.51 |              |  443.43 ± 87.86 |  440.76 ± 87.86 |  443.50 ± 87.86 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit |   tg32 |     13.96 ± 0.13 | 14.80 ± 0.40 |                 |                 |                 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit |  pp512 |  1285.66 ± 23.24 |              |   401.97 ± 7.61 |   399.31 ± 7.61 |   402.04 ± 7.60 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit |  tg128 |     14.01 ± 0.08 | 15.00 ± 0.00 |                 |                 |                 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit | pp2048 |   2260.08 ± 7.28 |              |   909.28 ± 2.91 |   906.62 ± 2.91 |   909.35 ± 2.90 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit |   tg32 |     14.39 ± 0.01 | 15.00 ± 0.00 |                 |                 |                 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit | pp2048 |  2234.26 ± 38.79 |              |  920.03 ± 16.58 |  917.37 ± 16.58 |  920.09 ± 16.58 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit |  tg128 |     14.39 ± 0.01 | 15.00 ± 0.00 |                 |                 |                 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit | pp8192 |  2546.50 ± 31.03 |              | 3220.42 ± 39.09 | 3217.76 ± 39.09 | 3220.48 ± 39.09 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit |   tg32 |     14.32 ± 0.01 | 15.00 ± 0.00 |                 |                 |                 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit | pp8192 |   2540.42 ± 4.36 |              |  3227.72 ± 5.55 |  3225.06 ± 5.55 |  3227.79 ± 5.55 |
| /model/Qwen3.5-122B-A10B-AWQ-4bit |  tg128 |     14.30 ± 0.01 | 15.00 ± 0.00 |                 |                 |                 |

Is it normal?

mmos · March 1, 2026, 12:18pm

Maybe my experience is abnormal - check this out from unsloth and what they say about the performance of the 397B param model and 1bit quant?

cho · March 1, 2026, 1:00pm

VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/Downloads/vllm/models:/models" \
./launch-cluster.sh --solo -t vllm-node-tf5 \
--apply-mod mods/fix-qwen3.5-autoround \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
--solo exec vllm serve /models/Qwen3.5-122B-A10B-int4-AutoRound \
--max-model-len auto \
--gpu-memory-utilization 0.7 \
--port 8000 \
--host 0.0.0.0 \
--load-format fastsafetensors \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--max-num-batched-tokens 8192 \
--trust-remote-code

llama-benchy --base-url http://127.0.0.1:8000/v1 --model /models/Qwen3.5-122B-A10B-int4-AutoRound --tokenizer /home/xfusion/Downloads/vllm/models/Qwen3.5-122B-A10B-int4-AutoRound --pp 512 2048 8192 --tg 32 128 --runs 5

1st run

| model                                    |   test |             t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:-----------------------------------------|-------:|----------------:|-------------:|----------------:|----------------:|----------------:|
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  pp512 | 1315.49 ± 10.05 |              |   391.85 ± 2.98 |   389.99 ± 2.98 |   391.91 ± 2.99 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |   tg32 |    28.35 ± 0.03 | 29.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  pp512 |  1327.76 ± 6.60 |              |   388.38 ± 2.00 |   386.52 ± 2.00 |   388.43 ± 2.01 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  tg128 |    28.42 ± 0.02 | 29.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound | pp2048 | 2162.68 ± 35.69 |              |  949.74 ± 16.18 |  947.89 ± 16.18 |  949.79 ± 16.19 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |   tg32 |    28.18 ± 0.02 | 29.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound | pp2048 |  2182.17 ± 3.14 |              |   940.92 ± 1.37 |   939.07 ± 1.37 |   940.98 ± 1.37 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  tg128 |    28.05 ± 0.34 | 29.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound | pp8192 | 2324.91 ± 52.48 |              | 3527.74 ± 82.60 | 3525.88 ± 82.60 | 3527.82 ± 82.59 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |   tg32 |    27.70 ± 0.02 | 28.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound | pp8192 |  2353.58 ± 3.52 |              |  3482.94 ± 5.46 |  3481.09 ± 5.46 |  3483.01 ± 5.45 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  tg128 |    27.65 ± 0.02 | 28.00 ± 0.00 |                 |                 |                 |

2nd run

| model                                    |   test |             t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:-----------------------------------------|-------:|----------------:|-------------:|---------------:|---------------:|----------------:|
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  pp512 | 1343.42 ± 22.38 |              |  383.57 ± 6.66 |  381.97 ± 6.66 |   383.62 ± 6.66 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |   tg32 |    28.33 ± 0.03 | 29.00 ± 0.00 |                |                |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  pp512 | 1331.18 ± 13.28 |              |  387.01 ± 4.11 |  385.41 ± 4.11 |   387.06 ± 4.11 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  tg128 |    28.36 ± 0.02 | 29.00 ± 0.00 |                |                |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound | pp2048 | 2183.20 ± 19.00 |              |  940.11 ± 8.48 |  938.51 ± 8.48 |   940.16 ± 8.49 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |   tg32 |    28.13 ± 0.01 | 29.00 ± 0.00 |                |                |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound | pp2048 |  2172.00 ± 9.85 |              |  945.09 ± 4.32 |  943.48 ± 4.32 |   945.15 ± 4.33 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  tg128 |    28.18 ± 0.02 | 29.00 ± 0.00 |                |                |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound | pp8192 |  2350.14 ± 2.19 |              | 3487.86 ± 3.32 | 3486.26 ± 3.32 |  3487.94 ± 3.32 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |   tg32 |    27.70 ± 0.05 | 28.00 ± 0.00 |                |                |                 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound | pp8192 |  2348.99 ± 1.16 |              | 3489.57 ± 1.69 | 3487.97 ± 1.69 |  3489.65 ± 1.70 |
| /models/Qwen3.5-122B-A10B-int4-AutoRound |  tg128 |    27.63 ± 0.02 | 28.00 ± 0.00 |                |                |                 |

cho · March 1, 2026, 2:55pm

looks great, could share your vllm serve command-line?

vedcsolution · March 1, 2026, 3:32pm

The recipe has nothing glorious about it if the tuned moe that is loaded in, - mods/fix-qwen3.5autoroundtuned , VLLM_TUNED_CONFIG_FOLDER: /root/.cache/huggingface/moe_tuned_qwen35_tp2_int4_ar_current_v1

Topic		Replies	Views
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	62	6029	June 14, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11220	April 9, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	431	21097	June 18, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	10157	March 24, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	236	9216	June 6, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	6265	May 4, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	308	26806	June 9, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5970	March 16, 2026
Fastest Qwen 3.5 122B Int4 recipe on DGX Spark tested and published on Spark-Arena DGX Spark / GB10 llama	59	2854	June 3, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	32	16142	June 16, 2026

Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

Prefill / prompt processing (pp2048, ctx_pp, pp2048 @ dXXXX)

Decode / generation (tg128, ctx_tg, tg128 @ dXXXX)

Related topics