Spark: one script CLI for setup, remote access, and LLM serving on DGX Spark

angelini92 · May 18, 2026, 2:55pm

Hey everyone,

This is my first time posting here, and hope I keep doing so :)

I recently acquired an Asus Ascent GX10 and had to face its setup and configuration so it could serve LLM models for me. I work with a MacBook Pro, so I needed remote access.

Even though it’s a straightforward but repetitive process, it must be the same for anyone acquiring one of these magic boxes, so I decided to create spark.

spark is a single Bash script that handles the full DGX Spark lifecycle: initial setup, remote access, and model serving with vLLM — everything from your laptop.

GitHub: https://github.com/massimo92/spark

Install & setup

curl -fsSL https://raw.githubusercontent.com/massimo92/spark/main/install.sh | bash

spark setup # Guided wizard — configures your laptop AND the DGX over SSH

The setup wizard runs entirely from your Mac/Linux machine. It connects to the DGX via SSH and configures everything in one pass: system updates, GPU check, Docker, NGC auth, HuggingFace CLI, Tailscale for remote access, SSH keys, and the vLLM container.

Serve a model

spark pull RedHatAI/Qwen3.6-35B-A3B-NVFP4

spark run RedHatAI/Qwen3.6-35B-A3B-NVFP4

curl localhost:8000/v1/models

What it does differently

Auto-profiler — reads config.json from the model and generates optimal vLLM flags automatically: reasoning parser (Qwen3, DeepSeek R1), tool-call parser, context length, multimodal detection, MoE architecture, and GPU memory utilization based on actual VRAM.
Zero dependencies — single Bash script, no Python, no package manager. Works on any system with bash and curl.
Remote-first — Tailscale integration so you can reach your DGX from anywhere. Setup disables password SSH after keys are configured.
Auto-update — checks for new spark CLI and NGC container versions once per day.

Available commands

spark setup          # Guided wizard (runs from laptop, configures DGX over SSH)

spark run <model> # Serve a model with vLLM (auto-detects optimal flags)

spark stop # Stop the running model

spark pull <model> # Download from HuggingFace

spark list # List downloaded models with sizes

spark status # What's running

spark doctor # Check all prerequisites

spark update # Update NGC vLLM container

spark run supports --tools (tool calling), --text-only (skip vision encoder), --dry-run, --force, --tail, and manual overrides for memory/context/port.

Would love feedback from other DGX Spark owners. Did it take too long for you to configure it? What’s missing? Don’t hesitate to open a PR :)

puiu.adrian · May 20, 2026, 8:17pm

hey hi ,i just tried it but it failes to load the vllm model . i installed only on the dgx .. so i m intending to launch straight from the spark via tailscale .

agp@gx10-8828:~$ spark logs -f 

==========
== vLLM ==
==========

NVIDIA Release 26.04 (build 299333414)
vLLM Version 0.19.0+6bc3197f
Container image Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 13.2 driver version 595.58.03 with kernel driver version 580.159.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299] 
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.0+6bc3197f.nv26.04.48680843
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299]   █▄█▀ █     █     █     █  model   RedHatAI/Qwen3.6-35B-A3B-NVFP4
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299] 
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:233] non-default args: {'model_tag': 'RedHatAI/Qwen3.6-35B-A3B-NVFP4', 'model': 'RedHatAI/Qwen3.6-35B-A3B-NVFP4', 'max_model_len': 32768, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.5}
(APIServer pid=1) WARNING 05-20 20:16:36 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_VERSION
(APIServer pid=1) WARNING 05-20 20:16:36 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_FLASH_ATTN_SRC_DIR
(APIServer pid=1) INFO 05-20 20:16:41 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=1) INFO 05-20 20:16:41 [model.py:1678] Using max model len 32768
(APIServer pid=1) INFO 05-20 20:16:41 [config.py:281] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 05-20 20:16:41 [config.py:312] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 05-20 20:16:41 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-20 20:16:41 [compilation.py:290] Enabled custom fusions: act_quant
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=1)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=1)     raise ValueError(
(APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
(APIServer pid=1) 
(APIServer pid=1) The above exception was the direct cause of the following exception:
(APIServer pid=1) 
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__
(APIServer pid=1)     self.renderer = renderer = renderer_from_config(self.vllm_config)
(APIServer pid=1)                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config
(APIServer pid=1)     tokenizer = cached_tokenizer_from_config(model_config, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config
(APIServer pid=1)     return cached_get_tokenizer(
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=1)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 108, in from_pretrained
(APIServer pid=1)     raise RuntimeError(err_msg) from e
(APIServer pid=1) RuntimeError: Failed to load the tokenizer. If the tokenizer is a custom tokenizer not yet available in the HuggingFace transformers library, consider setting `trust_remote_code=True` in LLM or using the `--trust-remote-code` flag in the CLI.

angelini92 · May 21, 2026, 5:19am

Hey @puiu.adrian! It should be already fixed. The issue was that some models use specific tokenizers that require —trust-remote-code=true when using HF. Thanks for giving feedback :)

angelini92 · May 21, 2026, 7:24am

puiu.adrian:

hey hi ,i just tried it but it failes to load the vllm model . i installed only on the dgx .. so i m intending to launch straight from the spark via tailscale .

agp@gx10-8828:~$ spark logs -f 

==========
== vLLM ==
==========

NVIDIA Release 26.04 (build 299333414)
vLLM Version 0.19.0+6bc3197f
Container image Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 13.2 driver version 595.58.03 with kernel driver version 580.159.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299] 
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.0+6bc3197f.nv26.04.48680843
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299]   █▄█▀ █     █     █     █  model   RedHatAI/Qwen3.6-35B-A3B-NVFP4
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:299] 
(APIServer pid=1) INFO 05-20 20:16:36 [utils.py:233] non-default args: {'model_tag': 'RedHatAI/Qwen3.6-35B-A3B-NVFP4', 'model': 'RedHatAI/Qwen3.6-35B-A3B-NVFP4', 'max_model_len': 32768, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.5}
(APIServer pid=1) WARNING 05-20 20:16:36 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_VERSION
(APIServer pid=1) WARNING 05-20 20:16:36 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_FLASH_ATTN_SRC_DIR
(APIServer pid=1) INFO 05-20 20:16:41 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=1) INFO 05-20 20:16:41 [model.py:1678] Using max model len 32768
(APIServer pid=1) INFO 05-20 20:16:41 [config.py:281] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 05-20 20:16:41 [config.py:312] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 05-20 20:16:41 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-20 20:16:41 [compilation.py:290] Enabled custom fusions: act_quant
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 85, in from_pretrained
(APIServer pid=1)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1153, in from_pretrained
(APIServer pid=1)     raise ValueError(
(APIServer pid=1) ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
(APIServer pid=1) 
(APIServer pid=1) The above exception was the direct cause of the following exception:
(APIServer pid=1) 
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 135, in __init__
(APIServer pid=1)     self.renderer = renderer = renderer_from_config(self.vllm_config)
(APIServer pid=1)                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/registry.py", line 83, in renderer_from_config
(APIServer pid=1)     tokenizer = cached_tokenizer_from_config(model_config, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 227, in cached_tokenizer_from_config
(APIServer pid=1)     return cached_get_tokenizer(
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/registry.py", line 210, in get_tokenizer
(APIServer pid=1)     tokenizer = tokenizer_cls_.from_pretrained(tokenizer_name, *args, **kwargs)
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/hf.py", line 108, in from_pretrained
(APIServer pid=1)     raise RuntimeError(err_msg) from e
(APIServer pid=1) RuntimeError: Failed to load the tokenizer. If the tokenizer is a custom tokenizer not yet available in the HuggingFace transformers library, consider setting `trust_remote_code=True` in LLM or using the `--trust-remote-code` flag in the CLI.

@puiu.adrian

RedHatAI/Qwen3.6-35B-A3B-NVFP4 was saved with transformers v5, which introduces a new tokenizer class (TokenizersBackend). The NGC 26.04 container ships transformers v4 and doesn’t recognize it. This will be fixed in NGC 26.05. Until then:

Edit the tokenizer config:

nano ~/.cache/huggingface/hub/models--RedHatAI--Qwen3.6-35B-A3B-NVFP4/snapshots/*/tokenizer_config.json

Find "tokenizer_class": "TokenizersBackend" and change it to:

"tokenizer_class": "PreTrainedTokenizerFast"

Run normally:

spark run RedHatAI/Qwen3.6-35B-A3B-NVFP4

Topic		Replies	Views
Spark-inference: Run 3 specialized models simultaneously on your DGX Spark — cybersecurity + coding + orchestration, 30-min setup DGX Spark / GB10 Projects jetson , llama , deepseek , nemotron	3	1364	May 11, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4956	March 6, 2026
HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!) DGX Spark / GB10 Projects docker , llama , dgx	39	2852	June 21, 2026
Managing Local LLM Orchestration DGX Spark / GB10 Projects	12	3102	April 23, 2026
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1421	June 20, 2026
DGX Spark performance DGX Spark / GB10	49	6487	February 13, 2026
Vibe Coding with NVIDIA DGX Spark DGX Spark / GB10	39	5826	May 10, 2026
How are you planning on using your DGX spark? DGX Spark / GB10 Projects	22	3340	February 24, 2026
New pre-built vLLM Docker Images for NVIDIA DGX Spark DGX Spark / GB10	73	9461	March 27, 2026
Running a Full LLM Stack on DGX Spark GB10 (Your Application -> LiteLLM -> llama-swap -> vLLM / llama.cpp / Ollama) DGX Spark / GB10 Projects spark , jetson , llama , nemotron , openclaw	19	3981	May 28, 2026

Spark: one script CLI for setup, remote access, and LLM serving on DGX Spark

Install & setup

Serve a model

What it does differently

Available commands

Related topics