Only got 50 TPS on Qwen3.5 35B A3B FP8

saikanov · April 9, 2026, 7:31am

so i tried to run Qwen3.5 35B A3B FP8 and expected to got at least 70-80 Token/Sec but i just got 50 Token/Sec

Disclaimer: i know people got same result on Spark Arena - LLM Leaderboard , i just want to correct my way of thinking about expected token throughput

as i read some article, i got that Token/sec is directly influenced by Memory Bandwidth, and Spark got 273GB/s and we need to transfer the model weight directly to the core, which i got the equation of

TPS = Memory Bandwidth(GBps) / Model Weight(GB) per Token

Qwen3.5 35B A3B FP8 is 3B active parameter with 1Byte per parameter so its 3GB Model Weight per Token

so if i use the equation, we should got this result

TPS = 273GBps / 3GB per Token

TPS = 91

so i should got 91 Token/second and i understand that there is a overhead of MoE Routing, KV cache, etc. but the overhead is roughly 44% (50/91*100) is this the normal overhead value or i miss something?

please freely correct me if i was wrong, because i need it to correct my understanding of inferencing LLM

best regards.

grindstone · April 9, 2026, 8:50am

Prefill: 4.2K-5.9K tok/s (excellent for 35B MoE on single Spark)
Decode: 42-44 tok/s (llama-benchy, open-ended text)
Decode: 113-127 tok/s (chat/reasoning tasks — MTP-2 acceptance rate much higher)
Peak: 44 tok/s raw decode, 127 tok/s with MTP-2 on structured output

Qwen3.5-122B-A10B on single Spark: 38.4 tok/s (patches + benchmark included) - #132 by norman.2

saikanov · April 10, 2026, 7:12am

thanks, i will follow it up

Topic		Replies	Views
TensorRT-LLM + nvidia/Llama-3.3-70B-Instruct-NVFP4 = 5 tok/s DGX Spark / GB10 llama	4	569	February 1, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	4440	March 16, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14384	March 24, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	668	December 19, 2025
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	229	4517	April 12, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	8024	April 9, 2026
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	22	1182	April 8, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	648	March 20, 2026
Value of 2nd Spark? DGX Spark / GB10 Projects	21	1423	March 30, 2026
Why 200 tok/s is new normal? — TP=2 Does Scale After All DGX Spark / GB10 Projects	20	1241	March 19, 2026

Only got 50 TPS on Qwen3.5 35B A3B FP8

Related topics