Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

is there any vllm flags to start in multimodal mode.
I am using Open WebUI for testing the model and it doesn’t allow to upload the videos \ images for communication with this hybrid model.
can’t find any setting in open webui interface to switch on sending images. and deepseek and chat.qwen.ai tell me that I should switch on image support with vllm.
can you help with this issue please?

Thanks a lot! Your install script works perfectly!

What would we do without eugr and Albond? :-)

Qwen3.5 122b multimodal and work only on multimodal mode (I tried to run in text mode only and get error) by default.
I tested right now (curl and Open WebUI), send image and get correct describe for picture. So probably something wrong in Open WebUI version or params.
Qwen3.5-VL technically supports video input, but Open WebUI’s chat UI doesn’t have a video upload widget yet … I guess.

UI-flag in Open WebUI → Admin Panel → Settings → Models → click on “qwen” → and in Capabilities section → Vision ☑.

Doing some other testing, and figured out why not put this on into the loop for reference data

Qwen3.5-35B-A3B on single DGX Spark (GB10, SM121) — Full optimization ladder

Using albond’s GitHub - albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4: Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%) · GitHub pipeline adapted for the 35B-A3B model. vllm-sm121 image pinned to v0.19.0+tf5.

Model: Intel/Qwen3.5-35B-A3B-int4-AutoRound (21 GB)

  ┌────────┬─────────────────────────────────┬────────────────┬──────────┬───────────────────────────────────────────────────────────┐
  │  Step  │             Config              │ tok/s (decode) │ vs BF16  │                           Notes                           │
  ├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ —      │ BF16 --enforce-eager            │      27.7      │   -10%   │ No CUDA Graphs, profiling mode                            │
  ├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ —      │ BF16 + CUDA Graphs + FlashInfer │      30.7      │ baseline │ --kv-cache-dtype fp8                                      │
  ├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ 0      │ INT4 AutoRound baseline         │      65.4      │  +113%   │ Marlin GPTQ, auto attention                               │
  ├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ 0      │ INT4 + FlashInfer               │      66.8      │  +118%   │ ~2% edge from FlashInfer                                  │
  ├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ 1      │ + Hybrid INT4+FP8 dense         │      66.3      │  +116%   │ Marginal on 35B (dense layers tiny in MoE)                │
  ├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ 2+3    │ + MTP-2 + INT8 LM Head v2       │    113–127     │  +310%   │ Code/JSON: 127, Trading: 107                              │
  ├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
  │ 2+3+TQ │ + TurboQuant35 KV cache         │     56–83      │  +130%   │ Slower — Triton attn overhead, model fits fine without TQ │
  └────────┴─────────────────────────────────┴────────────────┴──────────┴───────────────────────────────────────────────────────────┘

Peak: 127 tok/s (code generation, run 2) — 4.1× BF16 baseline on a single Spark.

Benchmark script runs 5 tasks × 2 runs: Q&A (256 tok), Code (512), Trading analysis (512), Math (64), Long code (2048).

I run a small model on my laptop, so no need to use the DGX Spark for 35B. I mentioned ~122 tok/s because when I’m reviewing patches I do many restarts and debug cycles, so a small fast model is much better for each iteration. But in reality I don’t like 35B models — in my experience I see wrong answers more often than with 122B.
But overall it might be useful for someone with simpler tasks, a smaller model, or higher parallel request load.
Good review 👍!

Just wanted to say I was able to use these instructions to get 122b up and running at 50tps+ with no issues so far in my testing. I also added fastsafetensors, auto tool choice, and qwen3 tool parser to the serve command. It has been running in my harness with no noticeable issues so far.

Thanks @Albond ! Great job!

This is working great for me, too, thank you @Albond . I think next on my list is to try to use a later spark-vllm-docker that hopefully supports DFlash. Reference: Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1

Now we all wait and hope that DFlash soon finally releases the drafter model for qwen3.5-122B :D

I have time right now to look into Block Diffusion … let me check.

any reason youre not using qwenxml instead of qwen coder?

Not sure what QwenXML is — got a link?

I went with Qwen3.5-122B-A10B because it’s currently the strongest model that fits on a single DGX Spark. The patches aren’t specific to it though — they should work with any Qwen MoE model, and everything is on GitHub if you want to port them.

I think i misread @mangosq comment about the tool parser he was using. qwen3xml is working a lot better then qwen3 coder, at least for claude code.

yeah. it was switched on.
I understood the problem
the problem was that Open WebUI don’t understand PNG files and wants JPG files.
your hybrid model works with images very fast! I am suprised

Ok, I see reason of long term release for DFlash Qwen3.5-122B-A10B - DeltaNet hybrid and I see some other troubles. But it can be best replace for MTP-2, not sure about x16, or x8, but we can see ~130-150 tok/s on DFlash Qwen3.5-122B-A10B.

In your opinion, which model is better to use for writing code: Qwen3.5-122B-A10B or Qwen3-coder-next?

@XQDev - Qwen3.5-122B-A10B-int4-AutoRound with bf16 KV cache has been better for research / planning / implementing / validating new features. ~ 28t/s

Qwen3-Coder-Next-int4-AutoRound with bf16 KV was faster and good at debugging, documenting, writing tests but not as capable at the other stuff.

Still testing this new qwen35-122b-hybrid-int4fp8, it looks promising – but I won’t be trading in my existing setup just yet TBD.


@Albond – Nice job! Ran the ./install.sh script and everything mostly worked.

I live in a bit of an internet backwater so to get Step 3 to work I had to add the following to the @eugr Docker file

# Increase timeout for large package downloads (default is 60s)
ENV UV_HTTP_TIMEOUT=300
ENV UV_RETRIES=10

Very Snappy!

╔══════════════════════════════════════════════════════╗
║  Qwen3.5-122B-A10B Benchmark: test 
║  Thu Apr  9 08:49:53 PM AEST 2026
╚══════════════════════════════════════════════════════╝

── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 4.96s = 51.6 tok/s (prompt: 23)
  [Code] 512 tokens in 9.97s = 51.3 tok/s (prompt: 30)
  [JSON] 1024 tokens in 20.32s = 50.3 tok/s (prompt: 48)
  [Math] 64 tokens in 1.39s = 46.0 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 37.55s = 54.5 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 5.07s = 50.4 tok/s (prompt: 23)
  [Code] 512 tokens in 9.69s = 52.8 tok/s (prompt: 30)
  [JSON] 1024 tokens in 19.75s = 51.8 tok/s (prompt: 48)
  [Math] 64 tokens in 1.35s = 47.4 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 37.71s = 54.3 tok/s (prompt: 37)

=== Done ===

Hope it codes as well ;)

Qwen3.5-122B-A10B 100%. IFBench for Qwen3-coder-next is too small for real tasks, just 35%.

Coding quality feels like Sonnet 4+ — solid. However, Qwen3.5 shares a common trait with Gemini: both tend to be opinionated. Once they settle on an approach, they resist course-correction. With my design and engineering background, I can usually guide a model toward what I need — but with Qwen3.5 and Gemini, it becomes a constant battle rather than a collaboration.

Yesterday, while testing image capabilities for XQDev, I sent Qwen3.5 122b photos from the Artemis 2 mission. It described the visuals correctly — space and a planet — but got the source wrong every time: one photo it attributed to Artemis 1, and another it identified as Pluto rather than the Moon. When I explicitly pointed out that today is April 9, 2026 and Artemis 2 has already happened — the model doubled down, insisting I was wrong because “the mission dates haven’t been set yet.” A total facepalm moment. And that’s exactly how it behaves when coding — the code itself is fine, but it poorly follows the design guidance and defaults to its own vision of the solution instead.