Sparkrun - central command with tab completion for launching inference on Spark Clusters

Experimental but hopefully useful release: sparkrun!

Repo: GitHub - scitrera/sparkrun: sparkrun - launch, manage, and stop LLM inference workloads on NVIDIA DGX Spark systems

Run everything (vllm + sglang +llama.cpp); solo + cluster; get VRAM estimates; easy to distribute and share recipes!

Installation

# uv is preferred mechanism for managing python environments
# To install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh

# automatic installation via uvx (manages virtual environment and 
# creates alias in your shell, sets up autocomplete too!)
uvx sparkrun setup install

WITH TAB COMPLETION SUPPORT FOR RECIPES AND OPTIONS ;-)

Create a Cluster

# Save your hosts once; you can have multiple named clusters; can be self with 127.0.0.1 host
sparkrun cluster create mylab --hosts 192.168.11.13,192.168.11.14 -d "My DGX Spark lab"
sparkrun cluster set-default mylab

Run a model

# Run Qwen3-1.7b-sglang
sparkrun run qwen3-1.7b-sglang 

# Run Qwen3-1.7b-vllm
sparkrun run qwen3-1.7b-vllm

Models come from recipes, compatible with GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks recipes. (And in fact, part of the idea is that this is more of a generic launcher also to align with @eugr and @raphael.amorim 's spark-area.com direction. In fact, this launcher, when used with recipes from eugr repo – will run eugr’s scripts directly. Otherwise, one can use vllm or sglang with other images (mine, NVIDIA, your own, etc.)

VRAM Estimates!

$ sparkrun show qwen3-1.7b-sglang 
Name:         qwen3-1.7b-sglang
Description:  Qwen3 1.7B -- small test model, solo or cluster (SGLang)
Maintainer:   scitrera.ai <open-source-team@scitrera.com>
Runtime:      sglang
Model:        Qwen/Qwen3-1.7B
Container:    scitrera/dgx-spark-sglang:0.5.8-t5
Nodes:        1 - unlimited
Repository:   Local
File Path:    /home/drew/oss-sparkrun/recipes/qwen3-1.7b-sglang.yaml

Defaults:
  gpu_memory_utilization: 0.3
  host: 0.0.0.0
  port: 8000
  served_model_name: qwen3-1.7b
  tensor_parallel: 1

Command:
  python3 -m sglang.launch_server \
    --model-path {model} \
    --served-model-name {served_model_name} \
    --mem-fraction-static {gpu_memory_utilization} \
    --tp-size {tensor_parallel} \
    --host {host} \
    --port {port} \
    --reasoning-parser deepseek-r1 \
    --trust-remote-code

VRAM Estimation:
  Model dtype:      bf16
  Model params:     1,700,000,000
  KV cache dtype:   bfloat16
  Architecture:     28 layers, 8 KV heads, 128 head_dim
  Model weights:    3.17 GB
  Tensor parallel:  1
  Per-GPU total:    3.17 GB
  DGX Spark fit:    YES

  GPU Memory Budget:
    gpu_memory_utilization: 30%
    Usable GPU memory:     36.3 GB (121 GB x 30%)
    Available for KV:      33.1 GB
    Max context tokens:    310,205

Supports multiple registries – you can add your own or use local recipes

drew@spark-840b:~$ sparkrun list
Name                                  Runtime     Registry            File
---------------------------------------------------------------------------------------------------------
nemotron3-nano-30b-nvfp4-vllm         vllm        sparkrun-official   nemotron3-nano-30b-nvfp4-vllm
nemotron3-nano-30b-vllm               vllm        sparkrun-official   nemotron3-nano-30b-vllm
qwen3-1.7b-sglang                     sglang      sparkrun-official   qwen3-1.7b-sglang
qwen3-1.7b-vllm                       vllm        sparkrun-official   qwen3-1.7b-vllm
qwen3-coder-next-fp8-sglang-cluster   sglang      sparkrun-official   qwen3-coder-next-fp8-sglang-cluster
GLM-4.7-Flash-AWQ                     eugr-vllm   eugr-vllm           glm-4.7-flash-awq
MiniMax-M2-AWQ                        eugr-vllm   eugr-vllm           minimax-m2-awq
MiniMax-M2.5-AWQ                      eugr-vllm   eugr-vllm           minimax-m2.5-awq
Nemotron-3-Nano-NVFP4                 eugr-vllm   eugr-vllm           nemotron-3-nano-nvfp4
OpenAI GPT-OSS 120B                   eugr-vllm   eugr-vllm           openai-gpt-oss-120b
Qwen3-Coder-Next-FP8                  eugr-vllm   eugr-vllm           qwen3-coder-next-fp8

# eugr-vllm gets recipes from https://github.com/eugr/spark-vllm-docker AND basically
# just passes through to scripts from there; the purpose of sparkrun is to be a unifying interface for running jobs and it would be woefully incomplete if it didn't include eugr's repo
drew@spark-840b:~$ sparkrun recipe --help
Usage: sparkrun recipe [OPTIONS] COMMAND [ARGS]...

  Manage recipe registries and search for recipes.

Options:
  --help  Show this message and exit.

Commands:
  add-registry     Add a new recipe registry.
  list             List available recipes from all registries.
  registries       List configured recipe registries.
  remove-registry  Remove a recipe registry.
  search           Search for recipes by name, model, or description.
  show             Show detailed recipe information.
  update           Update recipe registries from git.
  validate         Validate a recipe file.
  vram             Estimate VRAM usage for a recipe on DGX Spark.

8 Likes

Follow-up to myself; added llama.cpp runtime. Also uploaded llama.cpp docker image: scitrera/dgx-spark-llama-cpp:b8076-cu131 and added an example recipe @ oss-spark-run/recipes/qwen3-1.7b-llama-cpp.yaml at main · scitrera/oss-spark-run · GitHub

Technically I (at least partially) have RPC-based cluster mode for llama.cpp as well, but it’s super experimental / not officially supported yet.

drew@spark-840b:~$ sparkrun search qwen3
Name                                  Runtime     Model                       Registry
-----------------------------------------------------------------------------------------------
qwen3-1.7b-llama-cpp                  llama-cpp   Qwen/Qwen3-1.7B-GGUF:Q8_0   sparkrun-official
qwen3-1.7b-sglang                     sglang      Qwen/Qwen3-1.7B             sparkrun-official
qwen3-1.7b-vllm                       vllm        Qwen/Qwen3-1.7B             sparkrun-official
qwen3-coder-next-fp8-sglang-cluster   sglang      Qwen/Qwen3-Coder-Next-FP8   sparkrun-official
Qwen3-Coder-Next-FP8                  eugr-vllm   Qwen/Qwen3-Coder-Next-FP8   eugr-vllm


drew@spark-840b:~$ sparkrun run qwen3-1.7b-llama-cpp

Ensuring container image is available locally...
Image already available: scitrera/dgx-spark-llama-cpp:b8076-cu131
Ensuring model Qwen/Qwen3-1.7B-GGUF:Q8_0 is available locally...
GGUF model Qwen/Qwen3-1.7B-GGUF:Q8_0 already cached
GGUF model pre-synced, container path: /root/.cache/huggingface/hub/models--Qwen--Qwen3-1.7B-GGUF/snapshots/90862c4b9d2787eaed51d12237eafdfe7c5f6077/Qwen3-1.7B-Q8_0.gguf
Runtime:   llama-cpp
Image:     scitrera/dgx-spark-llama-cpp:b8076-cu131
Model:     Qwen/Qwen3-1.7B-GGUF:Q8_0
Cluster:   sparkrun_5b4a9ab3c4a9
Mode:      solo

VRAM Estimation:
  Model dtype:      q8_0
  Model params:     1,700,000,000
  KV cache dtype:   bfloat16
  Model weights:    1.58 GB
  Tensor parallel:  1
  Per-GPU total:    1.58 GB
  DGX Spark fit:    YES
  Warning: Missing architecture info (num_layers, num_kv_heads, head_dim); KV cache estimate unavailable

Hosts:     default cluster 'DGXSolo'
  Target:  127.0.0.1

Serve command:
  llama-server \
      -m /root/.cache/huggingface/hub/models--Qwen--Qwen3-1.7B-GGUF/snapshots/90862c4b9d2787eaed51d12237eafdfe7c5f6077/Qwen3-1.7B-Q8_0.gguf \
      --host 0.0.0.0 \
      --port 8000 \
      --n-gpu-layers 99 \
      --ctx-size 8192 \
      --flash-attn on \
      --jinja \
      --no-webui

Step 1/3: Detecting InfiniBand on 127.0.0.1...
  InfiniBand detected locally, NCCL configured
Step 1/3: IB detection done (0.3s)
Step 2/3: Launching container sparkrun_5b4a9ab3c4a9_solo on 127.0.0.1 (image: scitrera/dgx-spark-llama-cpp:b8076-cu131)...
Step 2/3: Container launched (0.3s)
Step 3/3: Executing serve command in sparkrun_5b4a9ab3c4a9_solo...
Step 3/3: Serve command dispatched (3.1s)
Following serve logs in container 'sparkrun_5b4a9ab3c4a9_solo' on 127.0.0.1 (Ctrl-C to stop)...
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 40960
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 1.7B
print_info: model params          = 1.72 B
print_info: general.name          = Qwen3 1.7B Instruct
print_info: vocab type            = BPE
print_info: n_vocab               = 151936
print_info: n_merges              = 151387
print_info: BOS token             = 151643 '<|endoftext|>'
print_info: EOS token             = 151645 '<|im_end|>'
print_info: EOT token             = 151645 '<|im_end|>'
print_info: PAD token             = 151643 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151659 '<|fim_prefix|>'
print_info: FIM SUF token         = 151661 '<|fim_suffix|>'
print_info: FIM MID token         = 151660 '<|fim_middle|>'
print_info: FIM PAD token         = 151662 '<|fim_pad|>'
print_info: FIM REP token         = 151663 '<|repo_name|>'
print_info: FIM SEP token         = 151664 '<|file_sep|>'
print_info: EOG token             = 151643 '<|endoftext|>'
print_info: EOG token             = 151645 '<|im_end|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 27 repeating layers to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   315.30 MiB
load_tensors:        CUDA0 model buffer size =  1743.77 MiB
........................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.32 MiB
llama_kv_cache:      CUDA0 KV buffer size =   896.00 MiB
llama_kv_cache: size =  896.00 MiB (  8192 cells,  28 layers,  4/1 seqs), K (f16):  448.00 MiB, V (f16):  448.00 MiB
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =   324.76 MiB
sched_reserve:  CUDA_Host compute buffer size =    24.01 MiB
sched_reserve: graph nodes  = 987
sched_reserve: graph splits = 2
sched_reserve: reserve took 21.46 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv    load_model: initializing slots, n_slots = 4
no implementations specified for speculative decoding
slot   load_model: id  0 | task -1 | speculative decoding context not initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 8192
no implementations specified for speculative decoding
slot   load_model: id  1 | task -1 | speculative decoding context not initialized
slot   load_model: id  1 | task -1 | new slot, n_ctx = 8192
no implementations specified for speculative decoding
slot   load_model: id  2 | task -1 | speculative decoding context not initialized
slot   load_model: id  2 | task -1 | new slot, n_ctx = 8192
no implementations specified for speculative decoding
slot   load_model: id  3 | task -1 | speculative decoding context not initialized
slot   load_model: id  3 | task -1 | new slot, n_ctx = 8192
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8000
main: starting the main loop...
srv  update_slots: all slots are idle

Awesome work, just in time for arrival of my 2nd GX10! :)

1 Like

recipe spec: oss-spark-run/RECIPES.md at main · scitrera/oss-spark-run · GitHub

1 Like

Added sparkrun setup ssh to help you configure SSH meshing ;-)

see README for details or sparkrun setup ssh --help

drew@spark-840b:~$ sparkrun setup ssh --help
Usage: sparkrun setup ssh [OPTIONS]

  Set up passwordless SSH mesh across cluster hosts.

  Ensures every host can SSH to every other host without password prompts.
  Creates ed25519 keys if missing and distributes public keys.

  By default, the machine running sparkrun is included in the mesh (--include-
  self). Use --no-include-self to exclude it.

  You will be prompted for passwords on first connection to each host.

  Examples:

    sparkrun setup ssh --hosts 192.168.11.13,192.168.11.14

    sparkrun setup ssh --cluster mylab --user ubuntu

    sparkrun setup ssh --cluster mylab --extra-hosts 10.0.0.1

Options:
  -H, --hosts TEXT                Comma-separated host list
  --hosts-file TEXT               File with hosts (one per line, # comments)
  --cluster CLUSTER               Use a saved cluster by name
  --extra-hosts TEXT              Additional comma-separated hosts to include
                                  (e.g. control machine)
  --include-self / --no-include-self
                                  Include this machine's hostname in the mesh
                                  [default: include-self]
  -u, --user TEXT                 SSH username (default: current user)
  -n, --dry-run                   Show what would be done
  --help                          Show this message and exit.

Ive been wondering when someone would simplify the vllm command structure and cluster setup. I started using your script, but I get a hang when the cluster is starting:
EngineCore_DP0 pid=1486) (RayWorkerWrapper pid=511, ip=192.168.177.12) WARNING 02-17 17:39:32 [worker_base.py:297] Missing shared_worker_lock argument from executor. This argument is needed for mm_processor_cache_type=‘shm’.
(EngineCore_DP0 pid=1486) (RayWorkerWrapper pid=1595) /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:283: UserWarning:
(EngineCore_DP0 pid=1486) (RayWorkerWrapper pid=1595) Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore_DP0 pid=1486) (RayWorkerWrapper pid=1595) Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore_DP0 pid=1486) (RayWorkerWrapper pid=1595) (8.0) - (12.0)
(EngineCore_DP0 pid=1486) (RayWorkerWrapper pid=1595)
(EngineCore_DP0 pid=1486) (RayWorkerWrapper pid=1595) warnings.warn(
(EngineCore_DP0 pid=1486) (RayWorkerWrapper pid=511, ip=192.168.177.12) INFO 02-17 17:39:33 [parallel_state.py:1307] world_size=2 rank=1 local_rank=0 distributed_init_method=tcp://192.168.177.11:34063 backend=nccl
(EngineCore_DP0 pid=1486) (RayWorkerWrapper pid=1595) INFO 02-17 17:39:34 [pynccl.py:111] vLLM is using nccl==2.27.7
[Stops here]

What recipe are you starting? It’s also possible that it’s giving no feedback while downloading models. There is a bug in model data synchronization that I am about to release a fix for. (vllm can be bad about not giving feedback on downloads).

Is there a lot of network (Internet) activity?

1 Like

I released the next version of sparkrun (v0.0.11). It fixes some bugs about model cache checks and adds model revision pinning to the recipe spec. If you installed via the uvx route recommended in the README then you can run:

drew@spark-840b:~$ sparkrun setup update
Updating sparkrun...
sparkrun, version 0.0.11

drew@spark-840b:~$ sparkrun setup update
Checking for updates (current: 0.0.11)...
sparkrun 0.0.11 is already the latest version.

(technically you don’t need to run it again but I wanted to see the change in response since I also improved the version checking/handling in v0.0.11).

2 Likes

Added experimental sglang gguf support; bunch of fixes AND… experimental claude code plugin to allow claude code to manage starting and stopping inference jobs on your spark(s).

I’ll probably iterate on the claude code plugin fairly quickly because I plan to use it to really accelerate some work when I need to switch/try different models!!

sparkrun also has a website with docs now:

1 Like

Thank you for your contribution dbsci! I’ve moved this thread to GB 10 Projects.

1 Like

v0.0.23 Released – lots of new features and fixes since I last posted something to forums.
sparkrun setup update

  • Added experimental sparkrun tune vllm and sparkrun tune sglang commands to help create triton MoE tunings. They will be saved to sparkrun’s local cache directory and automatically mounted when you launch container – ideally we’d be trying to upstream these configs – and/or I’m thinking we can possibly have them live in registries otherwise/in the meantime. This feature is still experimental (i.e. expect bugs), but I figure if it makes it easy for people to do triton MoE kernel auto-tuning then maybe we’ll all get a slight performance bump for popular models by creating GB10 specific configurations. Docs: sparkrun tune | sparkrun

  • Significant fixes to VRAM estimation – now VRAM estimation and KV parameters should come through without manual effort on most HF models

  • Added Qwen3.5-35B-A3B recipe for sglang to sparkrun’s default registry sparkrun recipe update to download the latest recipes from all registries

  • Heuristics to make runtime selection automatic in most cases (just trying to reduce boilerplate in recipes if possible) – updated spec at: Recipe Format | sparkrun

  • vllm uses torch distributed by default instead of ray (ray is still available and is used by default in conjunction with recipes from eugr’s repo or with recipes that otherwise use eugr’s build mods system); while it probably doesn’t make that much difference, we’re not really taking advantage of the benefits of ray – so it’s really just overhead. That being said, the vllm ray runtime isn’t going away.

  • Improvements to support multiple simultaneous deployments on the same cluster/nodes (e.g. autoincrementing ports to avoid collisions)

  • Lots of miscellaneous fixes and improvements throughout (e.g. hanging spaces after backslashes in recipe commands are automatically scrubbed instead of causing hard to diagnose CLI issues)

2 Likes

Cool project! I recently tested the new Qwen 3.5 397b model. How do I keep the model cache? I closed it and reopened it and it started to redownload the model again.

It should download the model to your local user’s huggingface cache, and, in the case of a cluster, distribute the model to the other nodes for you to avoid extra redownloads. (So it should already be caching it.)

It might not be working correctly due to a permissions issue for your huggingface cache directory? That could prevent things from working properly and therefore it’ll need to redownload a lot.

try sparkrun setup fix-permissions to have it try to fix the permissions on your huggingface cache. (will prompt your for sudo/user password)

and if that worked, you can also run sparkrun setup fix-permissions --save-sudo which will add a sudoers entry that allows your user to fix permissions for huggingface cache without entering password. (If that’s done, then sparkrun will try to fix permissions on its own before downloading model files to make sure that it doesn’t happen anymore.)

Also make sure you’re using the latest version sparkrun setup update!

Hope that helps. Check out: HuggingFace cache owned by root and Fixing Cache Permissions on the sparkrun.dev docs website if you want more details about those commands.

Let me know how it turns out.

1 Like

I think that did it! It’s at 5G/190G right now. 😅

Thank you.

1 Like

Awesome. That’s why I added that. I tried to anticipate all the things that typically go wrong and try to automate them all! ;-)

1 Like

sparkrun is magic, I really love it. I just notice it does not make use of HF_HOME env var. I am using an external disk for model caching, as for now I ended up using a symbolic link from my .cache dir to the external disk, but would be nice if HF_HOME support or a cli switch can be added :) btw great job man !

2 Likes

There is a --cache-dir cli option to set the cache path.

And since v0.1.1, the cache dir option can be set as a cluster option, so you don’t have to repeatedly type it if you configure your default cluster to include it.

I will add HF_HOME support in the next version. It makes sense that we’d follow HF defaults if not specified via CLI or explicit cluster config.

Thank you for the feedback. I’m really glad it’s useful to you!

Cool I will test it !

2 Likes

I really like your program, but I’m not very familiar with how to use it. I have a MacBook and a DGX Spark, and the usernames on these two devices are different. After connecting to the DGX Spark terminal via NVIDIA Sync from my MacBook, I installed the program and ran:

bash

gt-spark@GT-Spark:~/sparkrun$ sparkrun setup ssh --hosts 192.168.31.128,192.168.31.46 --user gt-spark
Setting up SSH mesh for user 'gt-spark' across 2 hosts...
Cluster Hosts: 192.168.31.128, 192.168.31.46

=== Phase 1: Connectivity check ===
[*] Checking SSH connectivity to gt-spark@192.168.31.128 ...
Warning: Permanently added '192.168.31.128' (ED25519) to the list of known hosts.
gt-spark@192.168.31.128's password: 
[*] Checking SSH connectivity to gt-spark@192.168.31.46 ...
ssh: connect to host 192.168.31.46 port 22: Connection refused

I just want something simpler, like running sparkrun directly on my single DGX Spark. When I try:

bash

gt-spark@GT-Spark:~/sparkrun$ sparkrun run recipe.yaml 
Error: No hosts specified. Use --hosts or configure defaults.

After adding --hosts 192.168.31.128:

bash

gt-spark@GT-Spark:~/sparkrun$ sparkrun run recipe.yaml --hosts 192.168.31.128
Detecting InfiniBand on 1 host(s)...
  Running script in parallel on 1 hosts: 192.168.31.128
  SSH script <- 192.168.31.128 FAILED rc=255 (0.1s): gt-spark@192.168.31.128: Permission denied (publickey,password).
  Parallel execution done: 0/1 OK (0.1s total)
  No InfiniBand detected, using default networking
  No IB IPs found, transfers will use management network
Distributing image 'scitrera/dgx-spark-sglang:0.5.9-t5' from local to 1 host(s)
Pulling image: scitrera/dgx-spark-sglang:0.5.9-t5...
Failed to pull image scitrera/dgx-spark-sglang:0.5.9-t5: permission denied while trying to connect to the docker API at unix:///var/run/docker.sock

Failed to ensure local image 'scitrera/dgx-spark-sglang:0.5.9-t5' — aborting distribution
Image distribution failed on: 192.168.31.128
Distributing model 'Qwen/Qwen3.5-35B-A3B' from local to 1 host(s)
Model Qwen/Qwen3.5-35B-A3B appears cached — verifying completeness...
Fetching 27 files: 100%|█████████████████████████████████████████████████████████████| 27/27 [00:00<00:00, 41850.04it/s]
Download complete: : 0.00B [00:00, ?B/s]              Model downloaded successfully: Qwen/Qwen3.5-35B-A3B00:00<?, ?it/s]
  Running script in parallel on 1 hosts: 192.168.31.128
  SSH script <- 192.168.31.128 FAILED rc=255 (0.1s): gt-spark@192.168.31.128: Permission denied (publickey,password).
  Parallel execution done: 0/1 OK (0.1s total)
Could not fix cache ownership on 1 host(s) — rsync may fail if Docker left root-owned files.  Run 'sparkrun setup fix-permissions --save-sudo' to enable passwordless chown for future runs.
  Running rsync in parallel to 1 hosts: 192.168.31.128
  Rsync -> 192.168.31.128
  Rsync <- 192.168.31.128 FAILED rc=255 (0.1s): gt-spark@192.168.31.128: Permission denied (publickey,password).
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(232) [sender
  Parallel rsync done: 0/1 OK (0.1s total)
Model distribution failed on hosts: ['192.168.31.128']
Model distribution failed on: 192.168.31.128
Distribution complete.
Runtime:   sglang
Image:     scitrera/dgx-spark-sglang:0.5.9-t5
Model:     Qwen/Qwen3.5-35B-A3B
Cluster:   sparkrun_c6b99162b027
Mode:      solo

VRAM Estimation:
  Model dtype:      bfloat16
  Model params:     35,951,827,504
  KV cache dtype:   bfloat16
  Architecture:     40 layers, 2 KV heads, 256 head_dim
  Model weights:    66.97 GB
  KV cache:         20.00 GB (max_model_len=262,144)
  Tensor parallel:  1
  Per-GPU total:    86.97 GB
  DGX Spark fit:    YES

  GPU Memory Budget:
    gpu_memory_utilization: 80%
    Usable GPU memory:     96.8 GB (121 GB x 80%)
    Available for KV:      29.8 GB
    Max context tokens:    391,046
    Context multiplier:    1.5x (vs max_model_len=262,144)

Hosts:     --hosts
  Target:  192.168.31.128

Serve command:
  python3 -m sglang.launch_server \
      --model-path Qwen/Qwen3.5-35B-A3B \
      --served-model-name qwen3.5-35b \
      --context-length 262144 \
      --mem-fraction-static 0.8 \
      --tp-size 1 \
      --host 0.0.0.0 \
      --port 8000 \
      --attention-backend triton \
      --reasoning-parser {reasoning_parser} \
      --tool-call-parser qwen3_coder \
      --fp8-gemm-backend cutlass \
      --speculative-algo {speculative_algo} \
      --speculative-num-steps {speculative_num_steps} \
      --speculative-eagle-topk {speculative_eagle_topk} \
      --speculative-num-draft-tokens {speculative_num_draft_tokens} \
      --trust-remote-code

  Running script in parallel on 1 hosts: 192.168.31.128
  SSH script <- 192.168.31.128 FAILED rc=255 (0.1s): gt-spark@192.168.31.128: Permission denied (publickey,password).
  Parallel execution done: 0/1 OK (0.1s total)
Could not clear page cache on 1 host(s) — run 'sparkrun setup clear-cache --save-sudo' to enable passwordless cache clearing for future runs.
Step 1/3: Using pre-detected NCCL env (0 vars)
Step 2/3: Launching container sparkrun_c6b99162b027_solo on 192.168.31.128 (image: scitrera/dgx-spark-sglang:0.5.9-t5)...
  SSH script <- 192.168.31.128 FAILED rc=255 (0.1s): gt-spark@192.168.31.128: Permission denied (publickey,password).
Failed to launch container: gt-spark@192.168.31.128: Permission denied (publickey,password).

Download complete: : 0.00B [00:02, ?B/s]

Based on the prompt, I ran sparkrun setup clear-cache --save-sudo --hosts 192.168.31.128, but even though I entered the correct password, it failed:

bash

gt-spark@GT-Spark:~/sparkrun$ sparkrun setup clear-cache --save-sudo --hosts 192.168.31.128
Clearing page cache on 1 host(s)...

Installing sudoers entry for passwordless cache clearing...
[sudo] password for gt-spark: 
  SSH sudo script <- 192.168.31.128 FAILED rc=255 (0.1s): gt-spark@192.168.31.128: Permission denied (publickey,password).
  [FAIL] 192.168.31.128: gt-spark@192.168.31.128: Permission denied (publickey,password).
Sudoers install: 0 OK, 1 failed.

  Running script in parallel on 1 hosts: 192.168.31.128
  SSH script <- 192.168.31.128 FAILED rc=255 (0.1s): gt-spark@192.168.31.128: Permission denied (publickey,password).
  Parallel execution done: 0/1 OK (0.1s total)
  SSH sudo script <- 192.168.31.128 FAILED rc=255 (0.1s): gt-spark@192.168.31.128: Permission denied (publickey,password).

Sudo authentication failed on 1 host(s). Retrying individually...
[sudo] password for gt-spark @ 192.168.31.128: 
  SSH sudo script <- 192.168.31.128 FAILED rc=255 (0.1s): gt-spark@192.168.31.128: Permission denied (publickey,password).
  [FAIL] 192.168.31.128: gt-spark@192.168.31.128: Permission denied (publickey,password).

Results: 1 failed.

I just want to be able to run LLM inference more simply on my single DGX Spark. Could you help me with this? I would be very grateful.

If you’re running these commands from bash the spark (e.g. over ssh) and you only have a single one, you can try:

your user needs to be a member of the docker group – that is not sparkrun specific, you need that for doing any containerized workloads with docker
sudo usermod -aG docker "gt-spark"

creating a default “cluster” (gt in this case) to ensure defaults are configured is an easy way to preconfigure sparkrun commands
sparkrun cluster create gt --hosts 127.0.0.1
sparkrun cluster set-default gt

then try
sparkrun run recipe.yaml

(the usermod and sparkrun cluster commands are one-time / setup commands, so you shouldn’t need to repeat them, just use sparkrun run <recipe> in the future).

Let me know if that works for you.