Time for experiments! Sparkrun - central command with tab completion for launching inference on Spark Clusters.
Centralized tool for running inference recipes drawing from many registries, ability to add your own registries, supports eugrβs vllm builds and my container images & others (NVIDIA too, etc.). Tab autocompletion for recipe lookups, commands, etc. Named clusters and saving default for working with multiple clusters. Supports solo and cluster. VRAM estimation. Supports vllm and sglang. Easy to add more runtimes. (Pass-thru delegation to recipe scripts in eugrβs repo is implemented as another runtime.) Recipe design was modified to be extremely similar to yours and @eugrβs recipes so (1) compatible for working together and (2) iterate on some details.
Iβm also interested to collaborate and contribute it to community org that manages such resources.
drew@spark-840b:~$ sparkrun search qwen3
Name Runtime Model Registry
-----------------------------------------------------------------------------------------------
qwen3-1.7b-sglang sglang Qwen/Qwen3-1.7B sparkrun-official
qwen3-1.7b-vllm vllm Qwen/Qwen3-1.7B sparkrun-official
qwen3-coder-next-fp8-sglang-cluster sglang Qwen/Qwen3-Coder-Next-FP8 sparkrun-official
Qwen3-Coder-Next-FP8 eugr-vllm Qwen/Qwen3-Coder-Next-FP8 eugr-vllm
This is extracted from what I was working on beforeβbasically I was making what I thought NVIDIA Sync should have been β which also included way more UI and ConnectX-7 Setup Wizard, etc. etcβ¦ itβs way faster to dump stuff into a CLI.
And making sure tab completion worked was like the best decision everβ¦ it really does make life betterβ¦
drew@spark-840b:~$ sparkrun cluster create DGXSolo --hosts 127.0.0.1
Created cluster 'DGXSolo' with 1 hosts
drew@spark-840b:~$ sparkrun cluster set-default DGXSolo
Set default cluster to 'DGXSolo'
drew@spark-840b:~$ sparkrun run qwen3-
qwen3-1.7b-sglang qwen3-1.7b-vllm qwen3-coder-next-fp8-sglang-cluster qwen3-coder-next-fp8
drew@spark-840b:~$ sparkrun run qwen3-1.7b-
qwen3-1.7b-sglang qwen3-1.7b-vllm
drew@spark-840b:~$ sparkrun run qwen3-1.7b-sglang
Ensuring container image is available locally...
Image already available: scitrera/dgx-spark-sglang:0.5.8-t5
Ensuring model Qwen/Qwen3-1.7B is available locally...
Downloading model: Qwen/Qwen3-1.7B...
Fetching 12 files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 12/12 [00:24<00:00, 2.07s/it]
Download complete: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.06G/4.06G [00:24<00:00, 335MB/s]Model downloaded successfully: Qwen/Qwen3-1.7B
Runtime: sglang
Image: scitrera/dgx-spark-sglang:0.5.8-t5
Model: Qwen/Qwen3-1.7B
Cluster: sparkrun_80efe3c1ea32
Mode: solo
config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 726/726 [00:00<00:00, 14.1MB/s]
config.json: 0%| | 0.00/726 [00:00<?, ?B/s]
VRAM Estimation:
Model dtype: bf16
Model params: 1,700,000,000
KV cache dtype: bfloat16
Architecture: 28 layers, 8 KV heads, 128 head_dim
Model weights: 3.17 GB
Tensor parallel: 1
Per-GPU total: 3.17 GB
DGX Spark fit: YES
GPU Memory Budget:
gpu_memory_utilization: 30%
Usable GPU memory: 36.3 GB (121 GB x 30%)
Available for KV: 33.1 GB
Max context tokens: 310,205
Hosts: default cluster 'DGXSolo'
Target: 127.0.0.1
Serve command:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-1.7B \
--served-model-name qwen3-1.7b \
--mem-fraction-static 0.3 \
--tp-size 1 \
--host 0.0.0.0 \
--port 8000 \
--reasoning-parser deepseek-r1 \
--trust-remote-code
Step 1/3: Detecting InfiniBand on 127.0.0.1...
InfiniBand detected locally, NCCL configured
Step 1/3: IB detection done (0.3s)
Step 2/3: Launching container sparkrun_80efe3c1ea32_solo on 127.0.0.1 (image: scitrera/dgx-spark-sglang:0.5.8-t5)...
Step 2/3: Container launched (0.8s)
Step 3/3: Executing serve command in sparkrun_80efe3c1ea32_solo...
Step 3/3: Serve command dispatched (3.1s)
Following serve logs in container 'sparkrun_80efe3c1ea32_solo' on 127.0.0.1 (Ctrl-C to stop)...
...logs...
Ctrl+C stops following logs and does not terminate inference β you can easily reconnect to logs
drew@spark-840b:~$ sparkrun logs qwen3-1.7b-sglang
And easily stop the inference job (obviously can also be stopped via docker)
drew@spark-840b:~$ sparkrun stop qwen3-1.7b-sglang
Workload stopped on 1 host(s).