is there any vllm flags to start in multimodal mode.
I am using Open WebUI for testing the model and it doesn’t allow to upload the videos \ images for communication with this hybrid model.
can’t find any setting in open webui interface to switch on sending images. and deepseek and chat.qwen.ai tell me that I should switch on image support with vllm.
can you help with this issue please?
Thanks a lot! Your install script works perfectly!
What would we do without eugr and Albond? :-)
Qwen3.5 122b multimodal and work only on multimodal mode (I tried to run in text mode only and get error) by default.
I tested right now (curl and Open WebUI), send image and get correct describe for picture. So probably something wrong in Open WebUI version or params.
Qwen3.5-VL technically supports video input, but Open WebUI’s chat UI doesn’t have a video upload widget yet … I guess.
UI-flag in Open WebUI → Admin Panel → Settings → Models → click on “qwen” → and in Capabilities section → Vision ☑.
Doing some other testing, and figured out why not put this on into the loop for reference data
Qwen3.5-35B-A3B on single DGX Spark (GB10, SM121) — Full optimization ladder
Using albond’s GitHub - albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4: Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%) · GitHub pipeline adapted for the 35B-A3B model. vllm-sm121 image pinned to v0.19.0+tf5.
Model: Intel/Qwen3.5-35B-A3B-int4-AutoRound (21 GB)
┌────────┬─────────────────────────────────┬────────────────┬──────────┬───────────────────────────────────────────────────────────┐
│ Step │ Config │ tok/s (decode) │ vs BF16 │ Notes │
├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
│ — │ BF16 --enforce-eager │ 27.7 │ -10% │ No CUDA Graphs, profiling mode │
├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
│ — │ BF16 + CUDA Graphs + FlashInfer │ 30.7 │ baseline │ --kv-cache-dtype fp8 │
├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
│ 0 │ INT4 AutoRound baseline │ 65.4 │ +113% │ Marlin GPTQ, auto attention │
├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
│ 0 │ INT4 + FlashInfer │ 66.8 │ +118% │ ~2% edge from FlashInfer │
├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
│ 1 │ + Hybrid INT4+FP8 dense │ 66.3 │ +116% │ Marginal on 35B (dense layers tiny in MoE) │
├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
│ 2+3 │ + MTP-2 + INT8 LM Head v2 │ 113–127 │ +310% │ Code/JSON: 127, Trading: 107 │
├────────┼─────────────────────────────────┼────────────────┼──────────┼───────────────────────────────────────────────────────────┤
│ 2+3+TQ │ + TurboQuant35 KV cache │ 56–83 │ +130% │ Slower — Triton attn overhead, model fits fine without TQ │
└────────┴─────────────────────────────────┴────────────────┴──────────┴───────────────────────────────────────────────────────────┘
Peak: 127 tok/s (code generation, run 2) — 4.1× BF16 baseline on a single Spark.
Benchmark script runs 5 tasks × 2 runs: Q&A (256 tok), Code (512), Trading analysis (512), Math (64), Long code (2048).
I run a small model on my laptop, so no need to use the DGX Spark for 35B. I mentioned ~122 tok/s because when I’m reviewing patches I do many restarts and debug cycles, so a small fast model is much better for each iteration. But in reality I don’t like 35B models — in my experience I see wrong answers more often than with 122B.
But overall it might be useful for someone with simpler tasks, a smaller model, or higher parallel request load.
Good review 👍!
Just wanted to say I was able to use these instructions to get 122b up and running at 50tps+ with no issues so far in my testing. I also added fastsafetensors, auto tool choice, and qwen3 tool parser to the serve command. It has been running in my harness with no noticeable issues so far.
Thanks @Albond ! Great job!
This is working great for me, too, thank you @Albond . I think next on my list is to try to use a later spark-vllm-docker that hopefully supports DFlash. Reference: Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1
Now we all wait and hope that DFlash soon finally releases the drafter model for qwen3.5-122B :D
I have time right now to look into Block Diffusion … let me check.
any reason youre not using qwenxml instead of qwen coder?
Not sure what QwenXML is — got a link?
I went with Qwen3.5-122B-A10B because it’s currently the strongest model that fits on a single DGX Spark. The patches aren’t specific to it though — they should work with any Qwen MoE model, and everything is on GitHub if you want to port them.
I think i misread @mangosq comment about the tool parser he was using. qwen3xml is working a lot better then qwen3 coder, at least for claude code.
yeah. it was switched on.
I understood the problem
the problem was that Open WebUI don’t understand PNG files and wants JPG files.
your hybrid model works with images very fast! I am suprised
Ok, I see reason of long term release for DFlash Qwen3.5-122B-A10B - DeltaNet hybrid and I see some other troubles. But it can be best replace for MTP-2, not sure about x16, or x8, but we can see ~130-150 tok/s on DFlash Qwen3.5-122B-A10B.
In your opinion, which model is better to use for writing code: Qwen3.5-122B-A10B or Qwen3-coder-next?
@XQDev - Qwen3.5-122B-A10B-int4-AutoRound with bf16 KV cache has been better for research / planning / implementing / validating new features. ~ 28t/s
Qwen3-Coder-Next-int4-AutoRound with bf16 KV was faster and good at debugging, documenting, writing tests but not as capable at the other stuff.
Still testing this new qwen35-122b-hybrid-int4fp8, it looks promising – but I won’t be trading in my existing setup just yet TBD.
@Albond – Nice job! Ran the ./install.sh script and everything mostly worked.
I live in a bit of an internet backwater so to get Step 3 to work I had to add the following to the @eugr Docker file
# Increase timeout for large package downloads (default is 60s)
ENV UV_HTTP_TIMEOUT=300
ENV UV_RETRIES=10
Very Snappy!
╔══════════════════════════════════════════════════════╗
║ Qwen3.5-122B-A10B Benchmark: test
║ Thu Apr 9 08:49:53 PM AEST 2026
╚══════════════════════════════════════════════════════╝
── Run 1/2 ──────────────────────────────────────
[Q&A] 256 tokens in 4.96s = 51.6 tok/s (prompt: 23)
[Code] 512 tokens in 9.97s = 51.3 tok/s (prompt: 30)
[JSON] 1024 tokens in 20.32s = 50.3 tok/s (prompt: 48)
[Math] 64 tokens in 1.39s = 46.0 tok/s (prompt: 29)
[LongCode] 2048 tokens in 37.55s = 54.5 tok/s (prompt: 37)
── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 5.07s = 50.4 tok/s (prompt: 23)
[Code] 512 tokens in 9.69s = 52.8 tok/s (prompt: 30)
[JSON] 1024 tokens in 19.75s = 51.8 tok/s (prompt: 48)
[Math] 64 tokens in 1.35s = 47.4 tok/s (prompt: 29)
[LongCode] 2048 tokens in 37.71s = 54.3 tok/s (prompt: 37)
=== Done ===
Hope it codes as well ;)
Qwen3.5-122B-A10B 100%. IFBench for Qwen3-coder-next is too small for real tasks, just 35%.
Coding quality feels like Sonnet 4+ — solid. However, Qwen3.5 shares a common trait with Gemini: both tend to be opinionated. Once they settle on an approach, they resist course-correction. With my design and engineering background, I can usually guide a model toward what I need — but with Qwen3.5 and Gemini, it becomes a constant battle rather than a collaboration.
Yesterday, while testing image capabilities for XQDev, I sent Qwen3.5 122b photos from the Artemis 2 mission. It described the visuals correctly — space and a planet — but got the source wrong every time: one photo it attributed to Artemis 1, and another it identified as Pluto rather than the Moon. When I explicitly pointed out that today is April 9, 2026 and Artemis 2 has already happened — the model doubled down, insisting I was wrong because “the mission dates haven’t been set yet.” A total facepalm moment. And that’s exactly how it behaves when coding — the code itself is fine, but it poorly follows the design guidance and defaults to its own vision of the solution instead.