Sparkrun - central command with tab completion for launching inference on Spark Clusters

I do patch a few other steps in there including registry update when you do

sparkrun update.

Most of the time uv tool update should be OK but occasionally, if there are migration steps, etc., then it might be better to use the official update.

The wizard also patches SSH to self to work (if necessary) and things like that, so good if that works. (Most people don’t need SSH to self… sparkrun will try to avoid SSH to self under many circumstances, but it’s actually needed for “cross-user” scenarios, so it is used in some cases.)

Note that the wizard is designed to be approximately idempotent – meaning in this case that you can run it repeatedly. So if something isn’t working or you just want to verify your configuration is good or whatever, you should be able to safely run it again – and it’ll change things if change is needed but otherwise… it won’t.

When configuring cluster by wizard, can we pass mDNS addresses as hostnames or it should be IPs?

I used sparkrun yesterday for the first time when I set up my 2-node cluster. Brilliant piece of software so thanks for developing it.

I would love to see two diagnostic features added to it:

It should work with any address that would work for SSH, so if the DNS resolves on your spark then it should be fine to use names – but DNS stuff, especially mDNS is actually kind of brittle, so I would recommend IP addresses. (And then even better if static IPs or DHCP reservations in place so that the IP won’t change.)

Glad it’s helpful to you. Those are good suggestions! I’ll put them on the list.

Love sparkrun, used directly on my cluster, noticed that the first node began to swap, as it handled the extra work. Decided to migrate to an no Spark node to handle the management and now I have an odd bug that I cannot get enough verbosity on to understand myself.

This is a model/setup I have used a lot already, just the kick off is off my NAS.

Distribution mode: delegated (image=sparkrun-eugr-vllm, model=cyankiwi/MiniMax-M2.7-AWQ-4bit, hosts=2)
Checking container image on 2 host(s)
Container image up-to-date on all 2 host(s)
Syncing model to 2 host(s)
SSH script ← 172.30.30.25 FAILED rc=1 (1.6s): downloading uv 0.11.7 aarch64-unknown-linux-gnu
Ignored error while writing commit hash to /tank/apollo11/.cache/huggingface/hub/models–cyankiwi–MiniMax-M2.7-AWQ-4bit/refs/main: [Errno 13] Permissio
Failed to ensure Model ‘cyankiwi/MiniMax-M2.7-AWQ-4bit’ on head 172.30.30.25
Error: Model distribution failed on: 172.30.30.25, 172.30.30.26

I am technically re-using the existing HF cache, which contains the files, but the error is on my NAS, which I assume is due to calculation of what is present locally versus the Sparks. And yes, this is ZFS, though nothing special in terms of ownership, same username, etc ..

Any ideas on how to get better clarity to diagnose further ?

You can get a lot more verbosity with sparkrun -vvv run <recipe> <options> (each v increases verbosity level). Three is the max level and it’s REALLY verbose. One or two vs is more reasonable in general. The default is fairly sparse on details since if all works according to plan, it’s just noise.

You can use sparkrun run <recipe> <options> --collect-diagnostics "diag.log" and that’ll collect all of the debug logs plus other data. (It basically collects all of the extra verbose data to a file instead of putting it out to stdout + additional data that is typically useful for diagnosis). If you want, you can submit the diagnostics log as an attachment to a github issue: Issues · spark-arena/sparkrun · GitHub and I can try to review and guide you specifically (instead of general forum chat).

I assume the issue is that you probably need to configure one or both of ssh user or cache directory as part of the cluster configuration.

sparkrun cluster update --help will give you more details on the options.
sparkrun cluster inspect <clusterName> will show you the effective configuration of a cluster.

As long as we’re effectively using the same user and cache directory, it should work out the same way, but as of now, it might not auto-determine the cache directory properly for the cluster – you might need to explicitly specify that as part of the cluster configuration. Once properly configured, it should just work from then on.

Note: you can also configure swappiness (e.g. see: SwapFaq - Community Help Wiki) and set it to 1. That should also help reduce eagerness to swap. I’ve been considering baking swappiness configuration into sparkrun but wasn’t sure if I should or not. Obviously if there is a lot of memory pressure, then swapping can occur, but the default swappiness value of 60 means that the system is much more likely to swap even without excessive pressure.

I also recommend that you give sparkrun sudo rights to clear the page cache, which can help sometimes. sparkrun setup clear-cache --save-sudo . Sometimes the page cache grows, especially on the node doing a lot of work like the head node / local sparkrun node from working on models/containers/etc. If sparkrun has permission, it’ll clear the page cache for you which can help. If you configured via the wizard, then it should’ve prompted you to do that already – but figured I’d mention it just in case.

And I guess I forgot to mention sparkrun setup fix-permissions which can specifically be used to reset the owner of cache files; however, that has been less necessary since the v0.2.x line of sparkrun that switched to not using root as the user within containers by default. Key thing to look at there is which user/UID owns the files on the NAS and compare that to the SSH user that is trying to access the cache files.

Bingo! the sparkrun wants to map the NAS’s naming scheme for the HF cache. Thanks for this command, I didn’t know it existed.

Yeah it was added as sort of a “beta” test, but I think it’s super useful and going to stick around, so I should promote it to be fully visible in docs, etc.

FYI to others – this is what the output looks like. It basically shows you effective configuration information as well as disk space details for model caches.

drew@spark-2918:~$ sparkrun cluster inspect
Cluster Configuration:
  cluster:            sparks25
  ssh_user:           (default)
  transfer_mode:      auto (resolved to: local)
  transfer_interface: auto (resolved to: cx7)
  topology:           switch
  hosts:              10.24.11.13, 10.24.11.14, 10.24.11.16, 10.24.11.17

NCCL Environment (head: 10.24.11.13):
  GLOO_SOCKET_IFNAME=enP7s7
  MN_IF_NAME=enP7s7
  NCCL_CROSS_NIC=1
  NCCL_IB_DISABLE=0
  NCCL_IB_GID_INDEX=3
  NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
  NCCL_IGNORE_CPU_AFFINITY=1
  NCCL_NET=IB
  NCCL_SOCKET_IFNAME=enP7s7,enp1s0f0np0,enP2p1s0f0np0
  NODE_IP=10.24.11.13
  OMPI_MCA_btl_tcp_if_include=enP7s7
  TP_SOCKET_IFNAME=enP7s7
  UCX_NET_DEVICES=rocep1s0f0:1,roceP2p1s0f0:1

Cache Paths:
  sparkrun (local):   /home/drew/.cache/sparkrun
  HF cache (local):   /home/drew/.cache/huggingface
  HF cache (remote):  /home/drew/.cache/huggingface

Directory Status:
  Host                           SR exists  SR size    HF exists  HF size    Free Space   HF path
  ----------------------------------------------------------------------------------------------------------------
  (local)                        yes        4.7M       yes        1.8T       527G         /home/drew/.cache/huggingface
  10.24.11.13                    yes        12M        yes        2.5T       637G         /home/drew/.cache/huggingface
  10.24.11.14                    yes        13M        yes        2.5T       867G         /home/drew/.cache/huggingface
  10.24.11.16                    yes        13M        yes        2.5T       876G         /home/drew/.cache/huggingface
  10.24.11.17                    yes        12K        yes        2.7T       699G         /home/drew/.cache/huggingface

@dbsci I’m getting this error. How can I fix it?

vllm serve: error: argument --default-chat-template-kwargs: invalid loads value: '{{"preserve_thinking":true}}'

This error occurred after executing a recipe( rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm).

I am doing this off the top of my head / unverified, but I suspect that’s a quoting artifact.

Fix the {{ and }} (they probably should be { and }). The value within the single quotes ('') should be a JSON string loadable via json.loads(...).

As I am getting creative, I realized that I have no idea where mods that relate to eugr’s style of recipes go. What directory do I put such things for my local yaml based configs ?

Wonderful question. That’s an option discussion (on its way to being implemented)…
Currently: mods are only supported specifically for eugr’s repo!
Options:

  1. Available relative to recipe path (so you can have mods/… alongside my-awesome-recipe.yaml)
    [One benefit: works the same for git and local recipe files]
  2. Allow recipe registries to declare a mods directory, and mods come from the mods directory from the same registry as the recipe file [downside: confusing if using local files for recipes]
  3. Allow mods to come from arbitrary (but hopefully not that arbitrary) git repos [using the sparkrun registry syntax would make sense – @my-registry/mods/some-mod, cool part there would be the ability to reference other people’s mods without needing to duplicate them – but bad part is security-wise, that seems less than ideal if they make changes to after you’ve last checked…]
  4. All of the above – by default, we search adjacent to recipe, then fallback to same-registry mods path, and if given with explicit syntax, then we search a given repo’s mod path.

Thoughts? Other ideas? I guess I’m leaning towards all of the above because then it supports full range of intuitive and power-user friendly options.

I just converted eugr’s “mods” to equivalent scripts/command in “pre_exec” fields in my own local recipes.

That works. I mean that’s what pre_exec is for… (technically eugr mods are converted to pre_exec statements on the fly).

However, I think that promoting “mods” to be a generic recipe component instead of eugr-specific makes sense for the long-run (pre_exec isn’t going anywhere…). People are used to them / the standard has been set by usage. So, a generic pass that converts mods to pre_exec statements compatible with eugr recipes and other recipe sources is definitely on the roadmap.

I think the personal recipe registry is perhaps ideal for how sparkrun behaves. Maybe not everyone has git, but the layout on disk could still that and make it easy when folks run their own Gitea or equivalent. So 3 is my primary pick.

I had to symlink in a self-made mod this morning, so I’d be in favor of option 1 at the bare minimum. I also think that would be optimal for iterating on a mod with a coding agent.

Hi, I tried running a model that is on a NFS mounted on all the nodes and I put the local path of the model instead of the hf identifier. It fails because it can’t find the local path in the hf cache… Is there a way to be able to do a sparkrun run on a yaml that has a local path to a model?

The correct sparkrun way to do that is to configure the NFS path as your cache directory path as part of the cluster configuration.

Example:
sparkrun cluster update my-cluster --cache-dir /mnt/nfs-path

So the idea is that you shouldn’t need to change how you refer to images in the recipe yaml, you configure the cluster to use the shared NFS path as the cache directory. Where models are stored is considered part of the cluster configuration. Recipes should be able to be stable when moved across different cluster/scenarios. If you do it that way, then also, running a new model would download the model to NFS and then operate from there.

thank you, I will try that way.