Missing vision reasoning with Qwen3.5-122B Q4 on vLLM (works on llama.cpp)

Hi everyone,

I am currently getting excellent results using Unsloth’s Qwen3.5-122B_Q6 with llama.cpp. Here is my working configuration:

    ./llama.cpp/build/bin/llama-server \
  -m /workspace/AIEngine/Qwen3.5-122B_Q6/Qwen3.5-122B-A10B-UD-Q6_K_XL-00001-of-00004.gguf \
  --mmproj /workspace/AIEngine/Qwen3.5-122B_Q6/mmproj-BF16.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  -c 60000 \
  -a qwen3.5_122B \
  -ngl 999 \
  --cache-ram 0 \
  -np 1 \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.00

However, due to slow inference speed (18 t/s), I am trying to switch to vLLM using a Q4 quantization (Qwen3.5-122B-A10B-int4-AutoRound).

I am launching it with the following command ( GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub ):

VLLM_SPARK_EXTRA_DOCKER_ARGS="-v /home/ollam3/Desktop/AIEngine:/models" \
./launch-cluster.sh --solo -t vllm-node-tf5 --apply-mod mods/fix-qwen3.5-autoround \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 exec vllm serve /models/Qwen3.5-122B-A10B-int4-AutoRound \
  --max-model-len 40000 \
  --gpu-memory-utilization 0.90 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --kv-cache-dtype fp8 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --trust-remote-code \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm

While the model loads and serves successfully, I am not getting any reasoning output when evaluating vision inputs.

Am I missing a specific parameter in my vLLM configuration? Or could it be that vision reasoning isn’t fully supported yet in this vLLM version (especially considering the custom mod-fix)?

Many thanks in advance for your help!

Kind regards,
P

You are missing the reasoning parser in vLLM arguments. Just use the recipe from the repo. Reasoning absolutely works with vision inputs - I tested it yesterday.

2 Likes

Thank you! I hope to find time on the weekend to successfully switch my pipeline to vLLM and may report back the speed increases for my use case (generation of medical reports that take hours).

Hi eugr,

Thanks for the previous help, it works perfectly now [~28 t/s vs ~18 t/s for Q6-Quant in llama.cpp ] I had to add

–reasoning-parser qwen3

One quick follow-up: the qwen3 parser puts the thought process in a “reasoning” field. My other scripts aren’t picking it up because they usually look for “reasoning_content”. I had to update some code.

Cheers,
Peter

Yeah, some models output in “reasoning”, some in “reasoning_content”. I believe, “reasoning” is now a standard way to do it.