Sparkrun - central command with tab completion for launching inference on Spark Clusters

Very very very scary. Crazy/ironic that the source of the leaked litellm credentials is being blamed on trivy leak – which is a very commonly used security tool.

sparkrun will now pin particular versions of all dependencies to reduce risk of supply chain attacks that may affect sparkrun. Next release is coming this week and it’s a major release.

Was sparkrun affected by the liteLLM infected release?

Unfortunately, sparkrun’s version was unpinned, so it entirely depends on when you launched the proxy. There was a window of a few hours when it would’ve been at risk if you freshly launched the proxy during that window.

Next version of sparkrun pins everything and also has security against shell injection attacks in recipes.

sparkrun has been updated. This is a big update (including transition to next minor version, 0.2.x).

First and foremost, this release marks the official transition of sparkrun to being part of the spark-arena organization. The git repo is now at: https://github.com/spark-arena/sparkrun to reflect sparkrun’s position as an effort for the community.

Beyond that, lots of changes:

  • Documentation Revamp
  • Spark Arena Integration (integrated login & benchmarking)
  • Security (work on shell injection protection for recipes; pinning all dependency versions; non-root user and non-privileged containers by default)
  • Fixes for cross-platform cache paths
  • Additional commands to enable automation and external use of sparkrun for orchestration
  • Addition of systemd service export
  • Tighter integration with eugr-vllm-docker
  • Intended to align with new spark-arena images for eugr vllm and llama.cpp (sglang coming soon)
  • Ability to configure transfer interface preferences as part of cluster configuration
  • Lots of bug fixes and internal architecture improvements
  • New setup wizard for better installation/setup experience for new users

I’ll post more shortly about how to get started with some of the new functionality.

@dbsci Just updated sparkrun (awesome project, thank you!), and the version was updated to 0.2.6:

dgx-spark:~$ sparkrun update
Checking for sparkrun updates (current: 0.2.3)…
sparkrun updated: 0.2.3 → 0.2.6

Updating recipe registries…
Updating 6 registries…
Updating sparkrun-testing… done
Updating sparkrun-transitional… done
Updating official… done
Updating experimental… done
Updating eugr… done
Updating community… done
6 registries updated.
dgx-spark:~$

The releases page ( Releases · spark-arena/sparkrun · GitHub ) shows 0.2.5 as being the most recent. Something I’m missing?

No. Sometimes I don’t end up listing the release as a github release. v0.2.6 was a quick fix to handle some issues that’ll come up for some edge cases. I quickly pushed out the patch and didn’t mark the release.

There is a tag, PR, and all the other things there to mark it – just not the “release” itself.

So you’re not missing anything.

Awesome, thank you for the explanation!

How easy is sparkrun to uninstall and remove its changes (even after running the setup wizard)?

Why would you ever want to do that???

Well it does a bunch of stuff and it depends on how much you do in the wizard – you can also choose to say no for lots of steps if you prefer how you did it yourself – but some of the setup wizard steps are pretty much a crystallization of experience of what typically helps people who are new to the spark, so you might want those even if you don’t use sparkrun…

Anyway, it’s a fair question, so I’ve written more detailed explanation below. And FYI, because of your questions, I’ve also started on making an uninstall so that it can remove itself more thoroughly – so that’ll probably come in the next release or so.


It installs itself as a uv tool, so uv tool uninstall sparkrun to remove it.

It creates two metadata directories:
~/.config/sparkrun for configuration stuff
~/.cache/sparkrun for cache stuff

Tab autocompletion adds this to ~/.bashrc:

# sparkrun tab-completion
eval "$(_SPARKRUN_COMPLETE=bash_source sparkrun)"

so you should remove that to get rid of tab completion element.

The other changes the wizard makes are marginally more complicated to remove because it’s part of basic cluster setup and isn’t necessarily specific to sparkrun:

  • SSH meshing (it saves ssh keys among node members) – you can remove/reduce authorized keys list (~/.ssh/authorized_keys for cluster user) but you probably want this in your cluster.
  • It adds user to the docker group if it’s not already there – you probably want that.
  • It adds targeted sudoers entries for clearing page cache and fixing HF cache dir permissions – relatively low risk of abuse, that’s why uses very targeted sudoers entries instead of broadly giving sudo rights (EDIT: /etc/sudoers.d/sparkrun-*; always uses sparkrun- prefix on sudo rules for traceability)
  • CX7 configuration – it’ll either create or edit netplan config if you want to, it’ll only recommend to make changes if your setup doesn’t meet guidelines

As an alternative to the wizard, you can also install with uvx sparkrun setup install (which will do the uv tool install, the tab completion, and put initial files in the config/cache directories) – but not direct you to the wizard, and then you’ll have full control over everything else. Undoing the uv tool, tab completion, and removing the cache/config directories is a relatively straightforward and complete removal.

So it depends why you’re asking – but I’ve taken care to try to keep the footprint relatively minimal (e.g. targeted sudoers entries and not blanket sudo access). The other points are essentially just applying best practices and support, but outside of the initial installation, everything in the wizard is technically optional but then you’re responsible for setting it up.

It’s partially my own paranoia but I am thinking back to my experience with other frameworks like Conda. Where you can find creative way of messing up installation like running the setup wizard twice, or having something in the configuration change after an update to the machine or framework.

In my experience small changes sneak up on you. So I care about understanding them.

Totally fair. I started with “why would you want to…” in a light joking way :-)

Also you can run the setup wizard multiple times here – it’s meant to guide you through stuff and generally its changes are idempotent – like there is no harm in trying to add yourself to docker group 50x – you’ll only end up added 1x.

Edit: in fact, I would recommend people run the wizard again if they were adding more nodes or stuff like that – because it automates the process – and any steps that are “redundant” to do, are performed in an idempotent way such that there is no harm in running it again.

(And I am adding uninstall since you brought it up as well – the wizard will keep track of what it’s done and then you’ll be able to uninstall against that record.)

sparkrun’s core functionality tries to stay contained to its metadata and cache directories basically, but because it’s an orchestration tool, it does have to touch other things as part of setup. Once setup (either manually or via the wizard), it basically keeps to itself. You could essentially “factory reset” sparkrun by deleting the cache and metadata directories.

Unrelated, but for some reason the site “sparkrun.dev” is blocked by my DNS provider. It seems Bezeq BCyber blacklisted your site for some reason 🤷‍♂️, can still get to it with other DNSs though.

That’s really weird… I can’t really imagine why… glad you can get to it otherwise because I have a lot more docs on the site – github has README and a few things but most docs are on the website.

I’ve had this in the past because of the rule which blocks newly registered domains.

Hi, I’m running sparkrun and hitting a build failure during the Docker image build step. The apt install seem to fail because several package versions aren’t available on ports.ubuntu.com

The build downloads 457 MB successfully but then fails with:

E: Failed to fetch http://ports.ubuntu.com/.../python3-wheel_0.42.0-2_all.deb
E: Failed to fetch http://ports.ubuntu.com/.../python3-pip_24.0+dfsg-1ubuntu1.3_all.deb
E: Failed to fetch http://ports.ubuntu.com/.../vim_9.1.0016-1ubuntu7.10_arm64.deb
E: Failed to fetch http://ports.ubuntu.com/.../libibverbs-dev_50.0-2ubuntu0.2_arm64.deb

The host machine can reach ports.ubuntu.com The packages just don’t appear to exist at those exact versions for ARM64 on my Ubuntu release.

Error: RuntimeError: eugr container build failed (exit 1)

Is this a known issue? Is there a workaround?

I used “sparkrun run @eugr/gemma4-26b-a4b” but it happens with any recipie with @eugr

I haven’t come across that. The problem is related to the build step and specific to building the spark-vllm-docker image.

One thing you can try as an alternative is to use:

sparkrun run @eugr/gemma4-26b-a4b --image "ghcr.io/spark-arena/dgx-vllm-eugr-nightly-tf5:20260406"

The --image is overriding the image in the recipe to a fixed/specific recipe version. The Spark Arena dgx-vllm-eugr images are built to stay up-to-date with the current spark-vllm-docker. Typically building the images locally is the faster way to get the latest version; however, that’s obviously not the case if it’s not working for you at all. This way should hopefully bypass that – and let you download built image from our github container registry.

also i updated sparkrun and now the recipies that worked fail for me

sparkrun run @sparkrun-transitionalsparkrun-transitional/qwen3.5-35b-a3b-fp8-sglang
sparkrun v0.2.20



Runtime:   sglang
Image:     scitrera/dgx-spark-sglang:0.5.9-dev1-329817e2-t5
Model:     Qwen/Qwen3.5-35B-A3B-FP8
Mode:      solo
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

VRAM Estimation:
Model dtype:      fp8
Model params:     35,953,925,552
KV cache dtype:   bfloat16
Architecture:     40 layers, 2 KV heads, 256 head_dim
Model weights:    33.48 GB
KV cache:         20.00 GB (max_model_len=262,144)
Tensor parallel:  1
Per-GPU total:    53.48 GB
DGX Spark fit:    YES

GPU Memory Budget:
gpu_memory_utilization: 80%
Usable GPU memory:     96.8 GB (121 GB x 80%)
Available for KV:      63.3 GB
Max context tokens:    829,886
Context multiplier:    3.2x (vs max_model_len=262,144)

Hosts:     default cluster ‘mylab’
Target:  127.0.0.1

[1/6] Preparing
done (0.0s)
[2/6] Building image — skipped (no builder)
[3/6] Distributing resources
SSH script ← 127.0.0.1 FAILED rc=255 (0.1s): dalsp@127.0.0.1: Permission denied (publickey,password).
Checking container image on 1 host(s)
SSH cmd ← 127.0.0.1 FAILED rc=255 (0.1s): dalsp@127.0.0.1: Permission denied (publickey,password).
SSH script ← 127.0.0.1 FAILED rc=255 (0.1s): dalsp@127.0.0.1: Permission denied (publickey,password).
Failed to ensure Image 'scitrera/dgx-spark-sglang:0.5.9-d

I get this with what you suggested too now

I think I managed to break your program :P

Is your OS username different than the “cluster” username? If so, you should rerun sparkrun setup ssh to enable SSH to self. I know that sounds ridiculous, but basically you need to authenticate if the username is different.

I’ll keep you posted, but I think I might have figured it out, if i create another cluster with the setup wizard it seems to run. The previous profile probably got messed up during the update, perhaps using “uv tool update” wasn’t the best idea.