8 nodes?
Thanks for sharing! WIll try this out
We had issues with tp=8. We now focus on tp=2 tp=4 setup only.
hi, and thank you for your work, i got 397b int4 working on 4 sparks and a 4x100gbit mikrotik starting at cca 35 t/s. What is the limitation with using 8 with an appropriate switch? I am asking because I ran 397b on vllm even on tp 16 using 16x3090, clustered in 8 nodes each with 2x3090, linked by 100gbit cards.
starting from 4 nodes 397b fp8 receipe, without mtp i reached 32t/s, very good for a fp8, it shows that if NVFP4 is optimized 60t/s should be possible considering the memory bandwidth limitation. I don’t know why mtp 2 is slowing it down if enabled..
I believe there is greater overhead for MTP as you add more sparks to a cluster
strange, same model on 4 nodes of 4x3090 each, same network, had big speedup at mtp 2 and 3, maybe is the extra processing needed..