so i tried to run Qwen3.5 35B A3B FP8 and expected to got at least 70-80 Token/Sec but i just got 50 Token/Sec
Disclaimer: i know people got same result on Spark Arena - LLM Leaderboard , i just want to correct my way of thinking about expected token throughput
as i read some article, i got that Token/sec is directly influenced by Memory Bandwidth, and Spark got 273GB/s and we need to transfer the model weight directly to the core, which i got the equation of
TPS = Memory Bandwidth(GBps) / Model Weight(GB) per Token
Qwen3.5 35B A3B FP8 is 3B active parameter with 1Byte per parameter so its 3GB Model Weight per Token
so if i use the equation, we should got this result
TPS = 273GBps / 3GB per Token
TPS = 91
so i should got 91 Token/second and i understand that there is a overhead of MoE Routing, KV cache, etc. but the overhead is roughly 44% (50/91*100) is this the normal overhead value or i miss something?
please freely correct me if i was wrong, because i need it to correct my understanding of inferencing LLM
best regards.