Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s

8 nodes?

Thanks for sharing! WIll try this out

We had issues with tp=8. We now focus on tp=2 tp=4 setup only.

hi, and thank you for your work, i got 397b int4 working on 4 sparks and a 4x100gbit mikrotik starting at cca 35 t/s. What is the limitation with using 8 with an appropriate switch? I am asking because I ran 397b on vllm even on tp 16 using 16x3090, clustered in 8 nodes each with 2x3090, linked by 100gbit cards.

starting from 4 nodes 397b fp8 receipe, without mtp i reached 32t/s, very good for a fp8, it shows that if NVFP4 is optimized 60t/s should be possible considering the memory bandwidth limitation. I don’t know why mtp 2 is slowing it down if enabled..

I believe there is greater overhead for MTP as you add more sparks to a cluster

strange, same model on 4 nodes of 4x3090 each, same network, had big speedup at mtp 2 and 3, maybe is the extra processing needed..