The latest eugr/spark-vllm-docker has tons of issues
Default ./build-and-copy.sh does not work, the tf5 always fails
./build-and-copy.sh -t vllm-node-tf5 also fails
Found issue from the git repo, tried ./build-and-copy.sh -t vllm-node-tf5 --rebuild-vllm, it finally builds. But when I launch the receipt for the Intel’s 122B autoround, I got error “Tokenizer class TokenizersBackend does not exist or is not currently imported”
Finally got around to figure it out by appending --tokenizer Qwen/Qwen3.5-122B-A10B to my receipt launch script
I modified Albond’s bench script and tested default Intel AutoRound 122BA10B with MCP=2
— Run 1 —
Testing Q&A… 256 tokens in 7.408480869s = 34.5 tok/s
Testing Code… 491 tokens in 12.491602249s = 39.3 tok/s
Testing JSON… 252 tokens in 6.342291768s = 39.7 tok/s
— Run 2 —
Testing Q&A… 256 tokens in 7.260207619s = 35.2 tok/s
Testing Code… 512 tokens in 13.180478058s = 38.8 tok/s
Testing JSON… 800 tokens in 20.527533992s = 38.9 tok/
Sorry @Albond for being so annoying… Would you run this on the Spark directly or in a Docker Container or something else? Whats the best practice as of now for running LLMs on the Spark?
Thanks for already answering two questions of mine today - I am really trying to learn but there are so many different opinions and my Spark is coming in 2 days so I had no time to try any out, just want to be prepared to run it on Day 1 :)
When I get it and set it up, if you ever need a test dummy for something I definitely would be willing to help if I am even able to help with anything.
No worries at all.
Docker, always. The SM121 build chain (patched vLLM + torch nightly + FlashInfer + NCCL) is too fragile to maintain directly on your host. Docker isolates it, and with --net=host --ipc=host --gpus all there’s no networking overhead — effectively native speed.
You can start from something simple: Ollama → llama.cpp → and eventually vLLM (most complex). Ollama’s great for learning with Llama 3 / Mistral / Qwen2.5. Next level is llama.cpp. And in final Advanced - vLLM. Install Docker + Open WebUI for a ChatGPT-like UI — in Admin there’s a section to add a custom LLM, e.g. http://:/v1.
Or try ./install.sh from the DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 repo — it’s the fastest way to a working Qwen3.5-122B setup.
PS: DGX Spark is still a “dev-on-desk” tool. Don’t expect 1 PFLOPs like in the marketing 🙂
I did this, It built fine, but when I launch the model using vllm, I"m getting this error:
Launching vllm-qwen35…
5af107798767cd10a3ff1f87520386daff648a5f37f415e30b2e0207bcb61feb
[ ok ] container started in background as ‘vllm-qwen35’
model loading takes ~13m22s on first run (cached re-launch: ~5-7 min)
polling http://127.0.0.1:8000/health every 5 sec — Ctrl-C to detach (container keeps running)
[ 4%] ░░░░░░░░░░░░░░░░░░░░░░░░ 35s / ~13m22s initializing (entrypoint, vLLM CLI)
[err ] container ‘vllm-qwen35’ has died — last 30 log lines:
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 154, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 887, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 535, in init
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Seems that its a memory error: (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py”, line 413, in request_memory
(EngineCore pid=152) raise ValueError(
(EngineCore pid=152) ValueError: Free memory on device cuda:0 (104.37/119.7 GiB) on startup is less than desired GPU memory utilization (0.9, 107.73 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
However when I ran once again with ./install.sh it start over again… 2 hours left again
I’ve updated install.sh with diagnostic include memory information. But yes, your DGX Spark should be free from other LLM running, this is 122b model for all RAM.
Please update project from github and run with v2.2:
./install.sh
The error is exactly a GPU memory shortage — another process is holding ~15 GB on your Spark, so vLLM can’t get its requested 107.73 GB (0.9 × 119.7 total). On DGX Spark GPU and CPU share the same 128 GB, so a running desktop/browser/IDE is enough to eat this.
If you can’t free the 15 GB, stop the current vllm-qwen35 container and re-launch with --gpu-memory-utilization 0.80 instead of 0.90. That drops vLLM’s request to ~96 GB which should fit:
To avoid the install.sh from creating a new snapshot location, I found my snapshot location from the first run and modified the this step:
── Step 0: hf download ───────────────────────────────────────────────────────
step_begin “Step 0 — Using specific Intel/Qwen3.5 snapshot”
“forcing use of known good snapshot bfac534…”
INTEL_DIR=“/home/hsien/.cache/huggingface/hub/models–Intel–Qwen3.5-122B-A10B-int4-AutoRound/snapshots/bfac534d4d8742dd15e46f7efdf73336b0213970” ←this is the snapshot folder that the first run saved to…
if [ ! -d “$INTEL_DIR” ]; then
abort “Specified INTEL_DIR not found. Please ensure the snapshot exists at: $INTEL_DIR”
fi
note “INTEL_DIR=${INTEL_DIR}”
step_end
Also had to add to the LAUNCH_CMD these 2 lines to work with claude code
–enable-auto-tool-choice \\
–tool-call-parser qwen3_coder \\