Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Albond · April 8, 2026, 4:10pm

Update: one-command install is live v2.1 🚀

git clone https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4.git
cd DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
./install.sh

or, if a previous build failed:

git clone https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4.git
cd DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
./install.sh --no-cache

Manual steps in README.md are unchanged for those who prefer to walk through it.

dysect · April 8, 2026, 4:35pm

Looking good:
./bench_qwen35.sh “v2”
╔══════════════════════════════════════════════════════╗
║ Qwen3.5-122B-A10B Benchmark: v2
║ Wed Apr 8 12:33:30 PM EDT 2026
╚══════════════════════════════════════════════════════╝

── Run 1/2 ──────────────────────────────────────
[Q&A] 256 tokens in 5.00s = 51.2 tok/s (prompt: 23)

[Code] 512 tokens in 9.42s = 54.3 tok/s (prompt: 30)
[JSON] 1024 tokens in 19.13s = 53.5 tok/s (prompt: 48)
[Math] 64 tokens in 1.25s = 51.2 tok/s (prompt: 29)
[LongCode] 2048 tokens in 35.68s = 57.3 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 4.84s = 52.8 tok/s (prompt: 23)

[Code] 440 tokens in 8.11s = 54.2 tok/s (prompt: 30)
[JSON] 1024 tokens in 19.21s = 53.3 tok/s (prompt: 48)
[Math] 64 tokens in 1.28s = 50.0 tok/s (prompt: 29)
[LongCode] 2048 tokens in 35.83s = 57.1 tok/s (prompt: 37)

=== Done ===

Albond · April 8, 2026, 5:07pm

I think I got the defective DGX Spark unit — everyone in this thread is faster +4 tok/s than me 😅 (jk, probably just needs more coffee to warm up)

XQDev · April 8, 2026, 6:03pm

╔══════════════════════════════════════════════════════╗
║  Qwen3.5-122B-A10B Benchmark: test
║  Wed Apr  8 07:44:55 PM MSK 2026
╚══════════════════════════════════════════════════════╝

── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 6.67s = 38.3 tok/s (prompt: 23)
  [Code] 512 tokens in 10.30s = 49.7 tok/s (prompt: 30)
  [JSON] 1024 tokens in 21.70s = 47.1 tok/s (prompt: 48)
  [Math] 64 tokens in 1.48s = 43.2 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 40.03s = 51.1 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 5.41s = 47.3 tok/s (prompt: 23)
  [Code] 443 tokens in 9.00s = 49.2 tok/s (prompt: 30)
  [JSON] 1024 tokens in 21.47s = 47.6 tok/s (prompt: 48)
  [Math] 64 tokens in 1.43s = 44.7 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 40.14s = 51.0 tok/s (prompt: 37)

=== Done ===

XQDev · April 8, 2026, 6:10pm

Albond you made a great work. it’s realy fast and works without any wring patterns. thank you!

xkm121 · April 8, 2026, 6:21pm

The latest eugr/spark-vllm-docker has tons of issues

Default ./build-and-copy.sh does not work, the tf5 always fails
./build-and-copy.sh -t vllm-node-tf5 also fails
Found issue from the git repo, tried ./build-and-copy.sh -t vllm-node-tf5 --rebuild-vllm, it finally builds. But when I launch the receipt for the Intel’s 122B autoround, I got error “Tokenizer class TokenizersBackend does not exist or is not currently imported”
Finally got around to figure it out by appending --tokenizer Qwen/Qwen3.5-122B-A10B to my receipt launch script

I modified Albond’s bench script and tested default Intel AutoRound 122BA10B with MCP=2

— Run 1 —
Testing Q&A… 256 tokens in 7.408480869s = 34.5 tok/s
Testing Code… 491 tokens in 12.491602249s = 39.3 tok/s
Testing JSON… 252 tokens in 6.342291768s = 39.7 tok/s

— Run 2 —
Testing Q&A… 256 tokens in 7.260207619s = 35.2 tok/s
Testing Code… 512 tokens in 13.180478058s = 38.8 tok/s
Testing JSON… 800 tokens in 20.527533992s = 38.9 tok/

So I assume this is working now?

Albond · April 8, 2026, 6:25pm

Did you try:

git clone https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4.git
cd DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
./install.sh --no-cache

This is new quick start v2.1.

xkm121 · April 8, 2026, 6:26pm

I will try that, benchmarking mtp=1 now

xkm121 · April 8, 2026, 6:51pm

Full benchmark

No mtp, no flashinfer

--- Run 1 ---
Testing Q&A... 256 tokens in 20.569007465s = 12.4 tok/s
Testing Code... 512 tokens in 18.037481175s = 28.3 tok/s
Testing JSON... 228 tokens in 8.100471390s = 28.1 tok/s

--- Run 2 ---
Testing Q&A... 256 tokens in 9.057125827s = 28.2 tok/s
Testing Code... 512 tokens in 18.021435648s = 28.4 tok/s
Testing JSON... 222 tokens in 7.870208508s = 28.2 tok/s

mtp=1, with flashinfer

--- Run 1 ---
Testing Q&A... 256 tokens in 19.571661206s = 13.0 tok/s
Testing Code... 512 tokens in 13.685070088s = 37.4 tok/s
Testing JSON... 220 tokens in 5.924777515s = 37.1 tok/s

--- Run 2 ---
Testing Q&A... 256 tokens in 7.342920863s = 34.8 tok/s
Testing Code... 512 tokens in 13.807468015s = 37.0 tok/s
Testing JSON... 222 tokens in 5.985807452s = 37.0 tok/s

mtp=2, with flashinfer

--- Run 1 ---
Testing Q&A... 256 tokens in 7.408480869s = 34.5 tok/s
Testing Code... 491 tokens in 12.491602249s = 39.3 tok/s
Testing JSON... 252 tokens in 6.342291768s = 39.7 tok/s

--- Run 2 ---
Testing Q&A... 256 tokens in 7.260207619s = 35.2 tok/s
Testing Code... 512 tokens in 13.180478058s = 38.8 tok/s
Testing JSON... 800 tokens in 20.527533992s = 38.9 tok/s

Thx for getting this figured out Albond! You definitely earned a git star from me!

djordjestojanovic1992 · April 8, 2026, 6:59pm

Sorry @Albond for being so annoying… Would you run this on the Spark directly or in a Docker Container or something else? Whats the best practice as of now for running LLMs on the Spark?
Thanks for already answering two questions of mine today - I am really trying to learn but there are so many different opinions and my Spark is coming in 2 days so I had no time to try any out, just want to be prepared to run it on Day 1 :)

When I get it and set it up, if you ever need a test dummy for something I definitely would be willing to help if I am even able to help with anything.

Albond · April 8, 2026, 7:15pm

No worries at all.
Docker, always. The SM121 build chain (patched vLLM + torch nightly + FlashInfer + NCCL) is too fragile to maintain directly on your host. Docker isolates it, and with --net=host --ipc=host --gpus all there’s no networking overhead — effectively native speed.

You can start from something simple: Ollama → llama.cpp → and eventually vLLM (most complex). Ollama’s great for learning with Llama 3 / Mistral / Qwen2.5. Next level is llama.cpp. And in final Advanced - vLLM. Install Docker + Open WebUI for a ChatGPT-like UI — in Admin there’s a section to add a custom LLM, e.g. http://:/v1.

Or try ./install.sh from the DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 repo — it’s the fastest way to a working Qwen3.5-122B setup.

PS: DGX Spark is still a “dev-on-desk” tool. Don’t expect 1 PFLOPs like in the marketing 🙂

stuckwi · April 8, 2026, 7:21pm

I did this, It built fine, but when I launch the model using vllm, I"m getting this error:

Launching vllm-qwen35…
5af107798767cd10a3ff1f87520386daff648a5f37f415e30b2e0207bcb61feb
[ ok ] container started in background as ‘vllm-qwen35’
model loading takes ~13m22s on first run (cached re-launch: ~5-7 min)
polling http://127.0.0.1:8000/health every 5 sec — Ctrl-C to detach (container keeps running)

[ 4%] ░░░░░░░░░░░░░░░░░░░░░░░░ 35s / ~13m22s initializing (entrypoint, vLLM CLI)
[err ] container ‘vllm-qwen35’ has died — last 30 log lines:
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 154, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 887, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 535, in init
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Albond · April 8, 2026, 7:57pm

Could you share log part with error:

docker logs vllm-qwen35 2>&1 | grep -B 2 -A 30 'EngineCore.*Error\|EngineCore.*Traceback' | head -80

or find error in all docker file:

docker logs vllm-qwen35 2>&1 > /tmp/vllm-crash.log && wc -l /tmp/vllm-crash.log

Not clear from this cut log.

stuckwi · April 8, 2026, 7:59pm

Thank you for responding. I re-ran the ./install, it went directly to loading the model, and is severing it now. Thank you

carlos.albarran.mx · April 8, 2026, 8:19pm

Got this error… when loading the image.
Any suggestions??

[ 4%] ░░░░░░░░░░░░░░░░░░░░░░░░ 35s / ~13m22s initializing (entrypoint, vLLM CLI)
[err ] container ‘vllm-qwen35’ has died — last 30 log lines:
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 154, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 887, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 535, in init
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

stuckwi · April 8, 2026, 8:21pm

Looks just like my log above. When I re-ran ./install.sh it loaded up.

carlos.albarran.mx · April 8, 2026, 8:25pm

Seems that its a memory error: (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py”, line 413, in request_memory
(EngineCore pid=152) raise ValueError(
(EngineCore pid=152) ValueError: Free memory on device cuda:0 (104.37/119.7 GiB) on startup is less than desired GPU memory utilization (0.9, 107.73 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

However when I ran once again with ./install.sh it start over again… 2 hours left again

dysect · April 8, 2026, 8:26pm

this seems like you are running more then needed. like what i consuming ~20GB on your device?

Albond · April 8, 2026, 8:35pm

I’ve updated install.sh with diagnostic include memory information. But yes, your DGX Spark should be free from other LLM running, this is 122b model for all RAM.

Please update project from github and run with v2.2:

./install.sh

The error is exactly a GPU memory shortage — another process is holding ~15 GB on your Spark, so vLLM can’t get its requested 107.73 GB (0.9 × 119.7 total). On DGX Spark GPU and CPU share the same 128 GB, so a running desktop/browser/IDE is enough to eat this.

If you can’t free the 15 GB, stop the current vllm-qwen35 container and re-launch with --gpu-memory-utilization 0.80 instead of 0.90. That drops vLLM’s request to ~96 GB which should fit:

docker rm -f vllm-qwen35

docker run -d --name vllm-qwen35 \
   --gpus all --net=host --ipc=host \
   -v ~/models:/models \
   vllm-qwen35-v2 \
   serve /models/qwen35-122b-hybrid-int4fp8 \
   --served-model-name qwen --port 8000 \
   --max-model-len 262144 \
   --gpu-memory-utilization 0.80 \
   --reasoning-parser qwen3 \
   --attention-backend FLASHINFER \
   --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

But not tested this … better free memory. Context can be drop down docker with OOM error.

stuckwi · April 8, 2026, 8:43pm

To avoid the install.sh from creating a new snapshot location, I found my snapshot location from the first run and modified the this step:
── Step 0: hf download ───────────────────────────────────────────────────────

step_begin “Step 0 — Using specific Intel/Qwen3.5 snapshot”
“forcing use of known good snapshot bfac534…”

INTEL_DIR=“/home/hsien/.cache/huggingface/hub/models–Intel–Qwen3.5-122B-A10B-int4-AutoRound/snapshots/bfac534d4d8742dd15e46f7efdf73336b0213970” ←this is the snapshot folder that the first run saved to…

if [ ! -d “$INTEL_DIR” ]; then
abort “Specified INTEL_DIR not found. Please ensure the snapshot exists at: $INTEL_DIR”
fi

note “INTEL_DIR=${INTEL_DIR}”
step_end

Also had to add to the LAUNCH_CMD these 2 lines to work with claude code
–enable-auto-tool-choice \\
–tool-call-parser qwen3_coder \\

Topic		Replies	Views
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	16136	March 24, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	258	22483	May 27, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	10488	April 9, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5619	March 16, 2026
Qwen3.5-35B-A3B optimizations on single Spark DGX Spark / GB10 Projects	48	2904	May 22, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	739	March 20, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	235	8578	May 23, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9672	March 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	5839	May 4, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	56	5382	April 13, 2026

Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Related topics