Hello there, I got a DGX and considering adding a second unit but wondering what the performance improvement could be.
I am sure nothing here is standard varies on model etc, etc, but if for instance using gpt-oss:20b and getting say 10 tokens per second, will adding a second unit lower that to 6 sec perhaps? (not expecting a linear gain actually)
Any performance gains using 2 cables vs 1 among 2 units ?
Hmm.. thanks for taking the time to reply. I think you just convinced me not to buy the second unit and return the first one :-). I still have a few more days to decide.
If you are just running 20b and don’t need any of the other features of the DGX Spark, then the RTX5090 is the way to go.
That said, I did find gpt-oss-20b with maximum thinking on the 5090 to be slower than gpt-oss-120b with minimal thinking on the DGX Spark. Even though the tokens/sec were way faster on the RTX5090, the model spent a lot more tokens thinking when you put it in high thinking mode.
It’s not consistent if 20b/high beats or loses to 120b/low for accuracy — it would depend on the nature of the questions you are asking.
I think we all want a RTX Pro 6000 MaxQ with 128GB for $4000 but alas, that’s why the 96GB RTX Pro 6000 is more expensive than two DGX Sparks
With properly setup dual Spark cluster you can expect almost 2x performance gain for dense models (slow ones) and less gain for sparse ones. Prompt processing performance scales better than inference.
Here is a compilation of my results - some of these need retesting as I was running with old config, but you can get an idea. That’s using VLLM and two Sparks connected via a single QSFP112 cable.