Hi everyone š
Over the past weeks, Iāve been working with @eugr on adding structured, reproducible ābenchmark recipesā to our community Docker runtime for LLMs. @eugr also added new export formats to Llama-Benchy. Those are foundations for a better knowledge sharing experience, but we still needed a common platform to publish our experiments.
The problem we keep seeing on DGX Spark threads is not lack of experimentation. Itās lack of reproducibility and indexing shared experiments.
Introducing: https://spark-arena.com/, A community-driven LLM Performance Leaderboard for the Spark
For every new model release, we all go through the same loop:
- Read the model card + docs
- Try different runtimes (vLLM / TensorRT-LLM / SGLang)
- Tune quantization (NVFP4, MXFP4, AWQ, etc.)
- Adjust
--kv-cache-dtype, attention backend, memory utilization - Experiment with multi-node configs
- Post partial flags in a thread
Weeks later, it becomes difficult to reconstruct:
- The exact CLI invocation
- The runtime backend versions
- The node topology
- The memory constraints
- The batching and concurrency parameters
So weāre formalizing this.
Spark Arena now supports:
⢠Structured benchmark submissions
⢠Full CLI + runtime flag capture
⢠Quantization + backend metadata
⢠Automated submission pipelines
⢠Comparable results across Spark owners
⢠āRecipesā that are reproducible end-to-end
⢠All integrated with our community tools
The goal is to turn benchmark results into executable and searchable knowledge and not just screenshots or throughput numbers.
Importantly, the data comes from real NVIDIA Developer Forum Spark owners running on their own Spark nodes, under real hardware constraints.
This is not lab-only data. It reflects real-world tuning tradeoffs from a community perspective.
Weād really like the community to engage:
- If youāre benchmarking models, consider submitting your results.
- If you care about reproducibility, help define what metadata is mandatory.
- If youāve struggled reproducing someone elseās setup, tell us what was missing.
- If youāve built internal benchmarking scripts, letās discuss integration.
The value of this platform scales with participation.
If we standardize how we share configs, we reduce duplicated work across the entire Spark ecosystem.
Feedback is welcome, especially from those pushing multi-node, high-concurrency, or aggressive quantization setups.
Letās make benchmarking on Spark composable, reproducible, and, most importantly, accessible to everyone.
Raphael Amorim