Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Update: one-command install is live v2.1 🚀

git clone https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4.git
cd DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
./install.sh

or, if a previous build failed:

git clone https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4.git
cd DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
./install.sh --no-cache

Manual steps in README.md are unchanged for those who prefer to walk through it.

Looking good:
./bench_qwen35.sh “v2”
╔══════════════════════════════════════════════════════╗
║ Qwen3.5-122B-A10B Benchmark: v2
║ Wed Apr 8 12:33:30 PM EDT 2026
╚══════════════════════════════════════════════════════╝

── Run 1/2 ──────────────────────────────────────
[Q&A] 256 tokens in 5.00s = 51.2 tok/s (prompt: 23)

[Code] 512 tokens in 9.42s = 54.3 tok/s (prompt: 30)
[JSON] 1024 tokens in 19.13s = 53.5 tok/s (prompt: 48)
[Math] 64 tokens in 1.25s = 51.2 tok/s (prompt: 29)
[LongCode] 2048 tokens in 35.68s = 57.3 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 4.84s = 52.8 tok/s (prompt: 23)

[Code] 440 tokens in 8.11s = 54.2 tok/s (prompt: 30)
[JSON] 1024 tokens in 19.21s = 53.3 tok/s (prompt: 48)
[Math] 64 tokens in 1.28s = 50.0 tok/s (prompt: 29)
[LongCode] 2048 tokens in 35.83s = 57.1 tok/s (prompt: 37)

=== Done ===

I think I got the defective DGX Spark unit — everyone in this thread is faster +4 tok/s than me 😅 (jk, probably just needs more coffee to warm up)

╔══════════════════════════════════════════════════════╗
║  Qwen3.5-122B-A10B Benchmark: test
║  Wed Apr  8 07:44:55 PM MSK 2026
╚══════════════════════════════════════════════════════╝

── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 6.67s = 38.3 tok/s (prompt: 23)
  [Code] 512 tokens in 10.30s = 49.7 tok/s (prompt: 30)
  [JSON] 1024 tokens in 21.70s = 47.1 tok/s (prompt: 48)
  [Math] 64 tokens in 1.48s = 43.2 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 40.03s = 51.1 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 5.41s = 47.3 tok/s (prompt: 23)
  [Code] 443 tokens in 9.00s = 49.2 tok/s (prompt: 30)
  [JSON] 1024 tokens in 21.47s = 47.6 tok/s (prompt: 48)
  [Math] 64 tokens in 1.43s = 44.7 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 40.14s = 51.0 tok/s (prompt: 37)

=== Done ===

Albond you made a great work. it’s realy fast and works without any wring patterns. thank you!

The latest eugr/spark-vllm-docker has tons of issues

  1. Default ./build-and-copy.sh does not work, the tf5 always fails

  2. ./build-and-copy.sh -t vllm-node-tf5 also fails

  3. Found issue from the git repo, tried ./build-and-copy.sh -t vllm-node-tf5 --rebuild-vllm, it finally builds. But when I launch the receipt for the Intel’s 122B autoround, I got error “Tokenizer class TokenizersBackend does not exist or is not currently imported”

  4. Finally got around to figure it out by appending --tokenizer Qwen/Qwen3.5-122B-A10B to my receipt launch script

I modified Albond’s bench script and tested default Intel AutoRound 122BA10B with MCP=2

— Run 1 —
Testing Q&A… 256 tokens in 7.408480869s = 34.5 tok/s
Testing Code… 491 tokens in 12.491602249s = 39.3 tok/s
Testing JSON… 252 tokens in 6.342291768s = 39.7 tok/s

— Run 2 —
Testing Q&A… 256 tokens in 7.260207619s = 35.2 tok/s
Testing Code… 512 tokens in 13.180478058s = 38.8 tok/s
Testing JSON… 800 tokens in 20.527533992s = 38.9 tok/

So I assume this is working now?

Did you try:

git clone https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4.git
cd DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
./install.sh --no-cache

This is new quick start v2.1.

I will try that, benchmarking mtp=1 now

Full benchmark

No mtp, no flashinfer

--- Run 1 ---
Testing Q&A... 256 tokens in 20.569007465s = 12.4 tok/s
Testing Code... 512 tokens in 18.037481175s = 28.3 tok/s
Testing JSON... 228 tokens in 8.100471390s = 28.1 tok/s

--- Run 2 ---
Testing Q&A... 256 tokens in 9.057125827s = 28.2 tok/s
Testing Code... 512 tokens in 18.021435648s = 28.4 tok/s
Testing JSON... 222 tokens in 7.870208508s = 28.2 tok/s

mtp=1, with flashinfer

--- Run 1 ---
Testing Q&A... 256 tokens in 19.571661206s = 13.0 tok/s
Testing Code... 512 tokens in 13.685070088s = 37.4 tok/s
Testing JSON... 220 tokens in 5.924777515s = 37.1 tok/s

--- Run 2 ---
Testing Q&A... 256 tokens in 7.342920863s = 34.8 tok/s
Testing Code... 512 tokens in 13.807468015s = 37.0 tok/s
Testing JSON... 222 tokens in 5.985807452s = 37.0 tok/s

mtp=2, with flashinfer

--- Run 1 ---
Testing Q&A... 256 tokens in 7.408480869s = 34.5 tok/s
Testing Code... 491 tokens in 12.491602249s = 39.3 tok/s
Testing JSON... 252 tokens in 6.342291768s = 39.7 tok/s

--- Run 2 ---
Testing Q&A... 256 tokens in 7.260207619s = 35.2 tok/s
Testing Code... 512 tokens in 13.180478058s = 38.8 tok/s
Testing JSON... 800 tokens in 20.527533992s = 38.9 tok/s

Thx for getting this figured out Albond! You definitely earned a git star from me!

Sorry @Albond for being so annoying… Would you run this on the Spark directly or in a Docker Container or something else? Whats the best practice as of now for running LLMs on the Spark?
Thanks for already answering two questions of mine today - I am really trying to learn but there are so many different opinions and my Spark is coming in 2 days so I had no time to try any out, just want to be prepared to run it on Day 1 :)

When I get it and set it up, if you ever need a test dummy for something I definitely would be willing to help if I am even able to help with anything.

No worries at all.
Docker, always. The SM121 build chain (patched vLLM + torch nightly + FlashInfer + NCCL) is too fragile to maintain directly on your host. Docker isolates it, and with --net=host --ipc=host --gpus all there’s no networking overhead — effectively native speed.

You can start from something simple: Ollama → llama.cpp → and eventually vLLM (most complex). Ollama’s great for learning with Llama 3 / Mistral / Qwen2.5. Next level is llama.cpp. And in final Advanced - vLLM. Install Docker + Open WebUI for a ChatGPT-like UI — in Admin there’s a section to add a custom LLM, e.g. http://:/v1.

Or try ./install.sh from the DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 repo — it’s the fastest way to a working Qwen3.5-122B setup.

PS: DGX Spark is still a “dev-on-desk” tool. Don’t expect 1 PFLOPs like in the marketing 🙂

I did this, It built fine, but when I launch the model using vllm, I"m getting this error:

Launching vllm-qwen35…
5af107798767cd10a3ff1f87520386daff648a5f37f415e30b2e0207bcb61feb
[ ok ] container started in background as ‘vllm-qwen35’
model loading takes ~13m22s on first run (cached re-launch: ~5-7 min)
polling http://127.0.0.1:8000/health every 5 sec — Ctrl-C to detach (container keeps running)

[ 4%] ░░░░░░░░░░░░░░░░░░░░░░░░ 35s / ~13m22s initializing (entrypoint, vLLM CLI)
[err ] container ‘vllm-qwen35’ has died — last 30 log lines:
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 154, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 887, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 535, in init
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Could you share log part with error:

docker logs vllm-qwen35 2>&1 | grep -B 2 -A 30 'EngineCore.*Error\|EngineCore.*Traceback' | head -80

or find error in all docker file:

docker logs vllm-qwen35 2>&1 > /tmp/vllm-crash.log && wc -l /tmp/vllm-crash.log

Not clear from this cut log.

Thank you for responding. I re-ran the ./install, it went directly to loading the model, and is severing it now. Thank you

Got this error… when loading the image.
Any suggestions??

[ 4%] ░░░░░░░░░░░░░░░░░░░░░░░░ 35s / ~13m22s initializing (entrypoint, vLLM CLI)
[err ] container ‘vllm-qwen35’ has died — last 30 log lines:
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 154, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 887, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 535, in init
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) File “/usr/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Looks just like my log above. When I re-ran ./install.sh it loaded up.

Seems that its a memory error: (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py”, line 413, in request_memory
(EngineCore pid=152) raise ValueError(
(EngineCore pid=152) ValueError: Free memory on device cuda:0 (104.37/119.7 GiB) on startup is less than desired GPU memory utilization (0.9, 107.73 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

However when I ran once again with ./install.sh it start over again… 2 hours left again

this seems like you are running more then needed. like what i consuming ~20GB on your device?

I’ve updated install.sh with diagnostic include memory information. But yes, your DGX Spark should be free from other LLM running, this is 122b model for all RAM.

Please update project from github and run with v2.2:

./install.sh

The error is exactly a GPU memory shortage — another process is holding ~15 GB on your Spark, so vLLM can’t get its requested 107.73 GB (0.9 × 119.7 total). On DGX Spark GPU and CPU share the same 128 GB, so a running desktop/browser/IDE is enough to eat this.

If you can’t free the 15 GB, stop the current vllm-qwen35 container and re-launch with --gpu-memory-utilization 0.80 instead of 0.90. That drops vLLM’s request to ~96 GB which should fit:

docker rm -f vllm-qwen35

docker run -d --name vllm-qwen35 \
   --gpus all --net=host --ipc=host \
   -v ~/models:/models \
   vllm-qwen35-v2 \
   serve /models/qwen35-122b-hybrid-int4fp8 \
   --served-model-name qwen --port 8000 \
   --max-model-len 262144 \
   --gpu-memory-utilization 0.80 \
   --reasoning-parser qwen3 \
   --attention-backend FLASHINFER \
   --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

But not tested this … better free memory. Context can be drop down docker with OOM error.

To avoid the install.sh from creating a new snapshot location, I found my snapshot location from the first run and modified the this step:
── Step 0: hf download ───────────────────────────────────────────────────────

step_begin “Step 0 — Using specific Intel/Qwen3.5 snapshot”
“forcing use of known good snapshot bfac534…”

INTEL_DIR=“/home/hsien/.cache/huggingface/hub/models–Intel–Qwen3.5-122B-A10B-int4-AutoRound/snapshots/bfac534d4d8742dd15e46f7efdf73336b0213970” ←this is the snapshot folder that the first run saved to…

if [ ! -d “$INTEL_DIR” ]; then
abort “Specified INTEL_DIR not found. Please ensure the snapshot exists at: $INTEL_DIR”
fi

note “INTEL_DIR=${INTEL_DIR}”
step_end

Also had to add to the LAUNCH_CMD these 2 lines to work with claude code
–enable-auto-tool-choice \\
–tool-call-parser qwen3_coder \\