I’ve been trying to reproduce the performance of the DGX Spark advertised here, but i can’t reproduce them on neither the 3B nor the 8B LLMs with this exact guide. Instead of the ~80k tokens per second throughput, i get ~11k tokens per second on the 3B and instead of 54k~ tokens per second, I get 9k tokens per second on the 8B model. I have written my current observations here in greater detail with also different backends like unsloth and different model sizes. I’m willing to retract my benchmarks, but the effective reality is that currently the DGX spark, at least for me, is significantly slower than expected.
Thank you for the details. I have passed this along to engineering for them to look at.
Please check out our benchmarking guide for benchmarking different models with different backends: DGX Spark Performance FAQ