Introducing the Spark Arena

Hi everyone šŸ‘‹

Over the past weeks, I’ve been working with @eugr on adding structured, reproducible ā€œbenchmark recipesā€ to our community Docker runtime for LLMs. @eugr also added new export formats to Llama-Benchy. Those are foundations for a better knowledge sharing experience, but we still needed a common platform to publish our experiments.

The problem we keep seeing on DGX Spark threads is not lack of experimentation. It’s lack of reproducibility and indexing shared experiments.

Introducing: https://spark-arena.com/, A community-driven LLM Performance Leaderboard for the Spark

For every new model release, we all go through the same loop:

  • Read the model card + docs
  • Try different runtimes (vLLM / TensorRT-LLM / SGLang)
  • Tune quantization (NVFP4, MXFP4, AWQ, etc.)
  • Adjust --kv-cache-dtype, attention backend, memory utilization
  • Experiment with multi-node configs
  • Post partial flags in a thread

Weeks later, it becomes difficult to reconstruct:

  • The exact CLI invocation
  • The runtime backend versions
  • The node topology
  • The memory constraints
  • The batching and concurrency parameters

So we’re formalizing this.

Spark Arena now supports:

• Structured benchmark submissions
• Full CLI + runtime flag capture
• Quantization + backend metadata
• Automated submission pipelines
• Comparable results across Spark owners
• ā€œRecipesā€ that are reproducible end-to-end
• All integrated with our community tools

The goal is to turn benchmark results into executable and searchable knowledge and not just screenshots or throughput numbers.

Importantly, the data comes from real NVIDIA Developer Forum Spark owners running on their own Spark nodes, under real hardware constraints.

This is not lab-only data. It reflects real-world tuning tradeoffs from a community perspective.

We’d really like the community to engage:

  • If you’re benchmarking models, consider submitting your results.
  • If you care about reproducibility, help define what metadata is mandatory.
  • If you’ve struggled reproducing someone else’s setup, tell us what was missing.
  • If you’ve built internal benchmarking scripts, let’s discuss integration.

The value of this platform scales with participation.
If we standardize how we share configs, we reduce duplicated work across the entire Spark ecosystem.

Feedback is welcome, especially from those pushing multi-node, high-concurrency, or aggressive quantization setups.

Let’s make benchmarking on Spark composable, reproducible, and, most importantly, accessible to everyone.

Raphael Amorim

27 Likes

Massive shoutout to Raphael Amorim @eugr for putting Spark Arena together and for turning what used to be a bunch of ad‑hoc experiments into something consistent, shareable, and actually easy to reproduce. It’s a real game changer for everyone running and benchmarking LLMs on DGXspark, and the way Spark Arena captures configs, flags, and end‑to‑end runs is exactly what this ecosystem needed.

One thing that would make it even more useful, especially when planning deployments on DGXspark, would be richer per‑model memory details in the docs and catalog. It would help a lot if each model entry included an approximate breakdown of system‑memory usage, including:

  • Model weights size for each supported precision.
  • KV cache usage at a few common context lengths, for example:
    • KV cache @ 4K context.
    • KV cache @ 8K context.
    • KV cache @ 16K context.
  • A recommended minimum system‑memory footprint per request for those context sizes.

This would be especially helpful for people like me who are considering buying a second DGXspark and trying to figure out what kinds of models can run side by side, and what’s realistically possible to host at the same time on a single DGXspark.

3 Likes

Cool sh1t! The used vLLM version/commit ID used for the test would be nice or time of build as we are living on the ā€œbleeding edgeā€ of versions… ;-)

3 Likes

Thanks for the suggestions. This is a work in progress and we’re going to build this together. Keep the ideas coming. It’s only going to get better over time.

2 Likes

how do i add my spark config our join your fork so that we can work on this together

@cosinus we’re already working on it. @eugr is leading this front.

1 Like

signup on the website

This is incredible. Great job.

Happy to start submitting benchmarks once approved.

1 Like

Great work! šŸ‘ This is an impressive initiative and it’s exciting to see the community coming together around it. Looking forward to contributing and seeing how it evolves.

1 Like

Just trying to spin up the recipes in the REPO. GPT OSS 120b Receipe

Using the command:

DGXspark (~/vllm_dev/spark-vllm-docker) $ ./run-recipe.sh openai-gpt-oss-120b --solo --setup

I am getting the following error:

I am sure that it is a simple dependency that I am missing.

Any help getting the recipe to run would be much appreciated.

Regards

Mark

=> ERROR [builder 8/15] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv --mount=type=cach 46.2s

[builder 8/15] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv --mount=type=cache,id=ccache,target=/root/.ccache cd flashinfer-cubin && uv build --no-build-isolation --wheel . --out-dir=/workspace/wheels -v:
0.160 DEBUG uv 0.9.24
0.160 DEBUG Acquired shared lock for /root/.cache/uv
0.160 DEBUG Project is contained in non-workspace project: /workspace/flashinfer
0.160 DEBUG Found workspace root: /workspace/flashinfer/flashinfer-cubin
0.160 DEBUG Adding root workspace member: /workspace/flashinfer/flashinfer-cubin
0.160 DEBUG No Python version file found in ancestors of working directory: /workspace/flashinfer/flashinfer-cubin
0.160 DEBUG Using request timeout of 30s
0.186 DEBUG Searching for Python >=3.8 in virtual environments, managed installations, or search path
0.186 DEBUG Searching for managed installations at /root/.local/share/uv/python
0.188 DEBUG Found cpython-3.12.3-linux-aarch64-gnu at /usr/bin/python3 (first executable in the search path)
0.188 DEBUG Using request timeout of 30s
0.188 DEBUG Not using uv build backend direct build of ., pyproject.toml does not match: The value for build_system.build-backend should be "uv_build", not "build_backend"
0.188 Building wheel…
0.189 DEBUG Project is contained in non-workspace project: /workspace/flashinfer
0.189 DEBUG No workspace root found, using project root
0.189 DEBUG Proceeding without build isolation
0.189 DEBUG Calling build_backend.build_wheel("/workspace/wheels", {}, None)
10.73 2026-02-11 22:29:38,367 - INFO - cubin_loader.py:81 - flashinfer.jit: Acquired lock for /workspace/flashinfer/flashinfer-cubin/flashinfer_cubin/cubins/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt
10.79 2026-02-11 22:29:38,432 - WARNING - cubin_loader.py:117 - flashinfer.jit: Downloading https://edge.urm.nvidia.com/artifactory/sw-kernelinferencelibrary-public-generic-local/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt: attempt 1 failed: 403 Client Error: Forbidden for url: https://edge.urm.nvidia.com/artifactory/sw-kernelinferencelibrary-public-generic-local/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt
10.79 2026-02-11 22:29:38,432 - INFO - cubin_loader.py:123 - flashinfer.jit: Retrying in 5 seconds…
15.88 2026-02-11 22:29:43,513 - WARNING - cubin_loader.py:117 - flashinfer.jit: Downloading https://edge.urm.nvidia.com/artifactory/sw-kernelinferencelibrary-public-generic-local/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt: attempt 2 failed: 403 Client Error: Forbidden for url: https://edge.urm.nvidia.com/artifactory/sw-kernelinferencelibrary-public-generic-local/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt
15.88 2026-02-11 22:29:43,513 - INFO - cubin_loader.py:123 - flashinfer.jit: Retrying in 10 seconds…
25.94 2026-02-11 22:29:53,575 - WARNING - cubin_loader.py:117 - flashinfer.jit: Downloading https://edge.urm.nvidia.com/artifactory/sw-kernelinferencelibrary-public-generic-local/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt: attempt 3 failed: 403 Client Error: Forbidden for url: https://edge.urm.nvidia.com/artifactory/sw-kernelinferencelibrary-public-generic-local/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt
25.94 2026-02-11 22:29:53,575 - INFO - cubin_loader.py:123 - flashinfer.jit: Retrying in 20 seconds…
46.01 2026-02-11 22:30:13,646 - WARNING - cubin_loader.py:117 - flashinfer.jit: Downloading https://edge.urm.nvidia.com/artifactory/sw-kernelinferencelibrary-public-generic-local/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt: attempt 4 failed: 403 Client Error: Forbidden for url: https://edge.urm.nvidia.com/artifactory/sw-kernelinferencelibrary-public-generic-local/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt
46.01 2026-02-11 22:30:13,647 - ERROR - cubin_loader.py:126 - flashinfer.jit: Max retries reached. Download failed.
46.01 Traceback (most recent call last):
46.01 File ā€œā€, line 11, in
46.01 File ā€œ/workspace/flashinfer/flashinfer-cubin/build_backend.pyā€, line 100, in build_wheel
46.01 _download_cubins()
46.01 File ā€œ/workspace/flashinfer/flashinfer-cubin/build_backend.pyā€, line 33, in _download_cubins
46.01 download_artifacts()
46.01 File ā€œ/workspace/flashinfer/flashinfer/artifacts.pyā€, line 193, in download_artifacts
46.01 cubin_files = list(get_subdir_file_list())
46.01 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46.01 File ā€œ/workspace/flashinfer/flashinfer/artifacts.pyā€, line 157, in get_subdir_file_list
46.01 checksums = get_checksums(cubin_dirs)
46.01 ^^^^^^^^^^^^^^^^^^^^^^^^^
46.01 File ā€œ/workspace/flashinfer/flashinfer/artifacts.pyā€, line 135, in get_checksums
46.01 with open(checksum_path, ā€œrā€) as f:
46.01 ^^^^^^^^^^^^^^^^^^^^^^^^
46.01 FileNotFoundError: [Errno 2] No such file or directory: ā€˜/workspace/flashinfer/flashinfer-cubin/flashinfer_cubin/cubins/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt’
46.01 Created build metadata file with version 0.6.1
46.01 Downloading cubins to /workspace/flashinfer/flashinfer-cubin/flashinfer_cubin/cubins…
46.19 Ɨ Failed to build /workspace/flashinfer/flashinfer-cubin
46.19 ā”œā”€ā–¶ The build backend returned an error
46.19 ╰─▶ Call to build_backend.build_wheel failed (exit status: 1)
46.19 hint: This usually indicates a problem with the package or the build
46.19 environment.
46.19 DEBUG Released lock at /root/.cache/uv/.lock


ERROR: failed to build: failed to solve: process ā€œ/bin/sh -c cd flashinfer-cubin && uv build --no-build-isolation --wheel . --out-dir=/workspace/wheels -vā€ did not complete successfully: exit code: 2
Error: Failed to build container

It’s a temporary glitch with nvidia servers - try again later and it will likely succeed.
Before trying again you can try one of those urls manually:

curl --silent --head --fail --location https://edge.urm.nvidia.com/artifactory/sw-kernelinferencelibrary-public-generic-local/75d477a640f268ea9ad117cc596eb39245713b9e/fmha/trtllm-gen/checksums.txt

If you see HTTP/1.1 200 OK, it works fine, otherwise it still fails.

Although 403 is a bit interesting - where are you located geographically?

1 Like

Tremendously valuable. Thank you.

Already approved your access Preston.

This is a great idea, and I’m sure it will be a valuable resource.

I do feel that there needs to be more context to the recipes, as the whole pyTorch-based platform seems to be pretty much a can of worms for the DGX Spark.

Now I like tinkering as much as anyone, but fighting vLLM and sgLang bugs just to get something running is getting a bit tired. This issue has been raised elsewhere in this forum, but I for one, would like to actually achieve some real work.

Here’s a quick example. I’ve been looking at Qwen3-Coder-Next as a model to power some OpenCode work. I downloaded the unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL model and built a llama.cpp engine to run it and got something running that’s delivering around 30 t/s without too much trouble. A good, productive result.

vLLM on the other hand is a different story. Using a new, standard build from the spark-vllm-docker together with the recipe for the Qwen/Qwen3-Coder-Next-FP8 seems to be pretty much a waste of time. It takes for ever to load, whereas llama.cpp is pretty spritely. Then on the first chat from OpenCode, it goes down in a screaming heap. Tried changing various options, but the only thing that worked was to remove the ā€œā€“enable-prefix-cachingā€ option.

Ok. So the model stays up now, but once the first chat is started from OpenCode, it justs sits there with its fingers in its ears. Again lowering the context size didn’t seem to help, unless it needs to be so small as to be useless.

So the net result is that vLLM is an unusable platform for this model, whereas I can get a very usable result in way less time using llama.cpp. At this point, there’s no contest.

I really hope the pyTorch-based solutions start addressing these concerns soon.

This model is kinda unique case though. Most other models work first in vLLM and perform better, especially with prompt processing.

Now, I don’t have anything against llama.cpp, I use it too, but it has its use cases and vLLM has its use cases, and if you have a Spark cluster, then vLLM/SGLang/TRTLLM are your only viable options anyway.

2 Likes

I agree, but at the moment, for this model, a running model on a single system using llama.cpp is better than a broken vLLM model that can’t run at all.

I guess my point was that the recipes might benefit from some feedback on their usability for real-world uses, not just benchmarks.

Well, the problem here is that vLLM is a moving target. To use latest and greatest, you pretty much need to compile from main, and sometimes things break. There has been a stream of commits recently that broke a lot of stuff. Some of those were fixed very quickly, partly because I was testing new build after the bad commit came out and was able to discuss it with the guy who pushed it and he reverted it.

In other cases, we are not so lucky.

This particular Qwen issue was introduced sometime after February 9th, and there is an open ticket for it: [Bug]: Qwen Coder Next prefix caching Ā· Issue #34361 Ā· vllm-project/vllm Ā· GitHub

The same commit probably broke GLM-4.7-Flash. I don’t have an open ticket, but will open one tomorrow if it’s not fixed by then.

For now, you can rebuild with this commit that is known to work:

./build-and-copy.sh -t vllm-node-20260209  -c --vllm-ref 13397841ab469cecf1ed425c3f52a9ffc38139b5

(Remove -c if you don’t have a cluster)

That’s the price we pay for having a big and active development community.

llama.cpp seems to be a bit more tightly controlled, but it’s not like things don’t break there too.
We’ve all had broken builds when they introduced Blackwell optimizations for gpt-oss, and we have a performance regression on Strix Halo under ROCm 7.10 since December.

2 Likes

Very fun, very impressive, and very useful.

I’m looking forward to contributing and benefiting from this!

I like the idea @raphael.amorim @eugr
maintining me in the loop, thank you guys for all contributions!

4 Likes

Thanks for all your work!

As much as I’d love to spend more time with this technology, the pressures of work mean I need to find workable solutions, in this case choosing llama.cpp over vLLM.

However, thanks to your knowledge, I’m able to run Qwen3-Coder-Next with pretty good performance on my Spark. Hoping to see the cluster problem solved in the not too distant future.

This all goes back to my original point. I think the Spark Arena solution would be enhanced by a model log that tracks changes and updates to help get the models running. In this instance, the model log could have reflected that there was a breaking change in vLLM and that building from a particular commit would solve the problem. When the problem is actually fixed, then the log can be updated to reflect the current status. This added knowledge would enhanced the usefulness of the recipes in this fast changing world.