Missing vision reasoning with Qwen3.5-122B Q4 on vLLM (works on llama.cpp)

peterwern · March 11, 2026, 9:24pm

Hi everyone,

I am currently getting excellent results using Unsloth’s Qwen3.5-122B_Q6 with llama.cpp. Here is my working configuration:

    ./llama.cpp/build/bin/llama-server \
  -m /workspace/AIEngine/Qwen3.5-122B_Q6/Qwen3.5-122B-A10B-UD-Q6_K_XL-00001-of-00004.gguf \
  --mmproj /workspace/AIEngine/Qwen3.5-122B_Q6/mmproj-BF16.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  -c 60000 \
  -a qwen3.5_122B \
  -ngl 999 \
  --cache-ram 0 \
  -np 1 \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.00

However, due to slow inference speed (18 t/s), I am trying to switch to vLLM using a Q4 quantization (Qwen3.5-122B-A10B-int4-AutoRound).

I am launching it with the following command ( GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub ):

VLLM_SPARK_EXTRA_DOCKER_ARGS="-v /home/ollam3/Desktop/AIEngine:/models" \
./launch-cluster.sh --solo -t vllm-node-tf5 --apply-mod mods/fix-qwen3.5-autoround \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 exec vllm serve /models/Qwen3.5-122B-A10B-int4-AutoRound \
  --max-model-len 40000 \
  --gpu-memory-utilization 0.90 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --kv-cache-dtype fp8 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --trust-remote-code \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm

While the model loads and serves successfully, I am not getting any reasoning output when evaluating vision inputs.

Am I missing a specific parameter in my vLLM configuration? Or could it be that vision reasoning isn’t fully supported yet in this vLLM version (especially considering the custom mod-fix)?

Many thanks in advance for your help!

Kind regards,
P

eugr · March 11, 2026, 9:30pm

You are missing the reasoning parser in vLLM arguments. Just use the recipe from the repo. Reasoning absolutely works with vision inputs - I tested it yesterday.

peterwern · March 12, 2026, 9:24pm

Thank you! I hope to find time on the weekend to successfully switch my pipeline to vLLM and may report back the speed increases for my use case (generation of medical reports that take hours).

peterwern · March 13, 2026, 9:37pm

Hi eugr,

Thanks for the previous help, it works perfectly now [~28 t/s vs ~18 t/s for Q6-Quant in llama.cpp ] I had to add

–reasoning-parser qwen3

One quick follow-up: the qwen3 parser puts the thought process in a “reasoning” field. My other scripts aren’t picking it up because they usually look for “reasoning_content”. I had to update some code.

Cheers,
Peter

eugr · March 13, 2026, 11:11pm

Yeah, some models output in “reasoning”, some in “reasoning_content”. I believe, “reasoning” is now a standard way to do it.

Topic		Replies	Views
Adding recipe support for OptimizeLLM/Qwen3-VL-30B-A3B-Thinking-NVFP4 DGX Spark / GB10	6	209	March 20, 2026
Support for openai_gptoss reasoning parser in vLLM, and its impact on the effective inference performance on Spark DGX Spark / GB10 benchmarks	8	538	February 9, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	8305	April 9, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4368	April 11, 2026
VLLM -- the $150M train wreck? DGX Spark / GB10 llama	24	1170	February 27, 2026
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	12	2001	January 22, 2026
How do I run Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled on vllm community docker? DGX Spark / GB10 llama	4	1950	March 13, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14517	March 24, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2761	December 31, 2025
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	8389	March 24, 2026

Missing vision reasoning with Qwen3.5-122B Q4 on vLLM (works on llama.cpp)

Related topics