Qwen3.5 Tool Calling finally fixed (possibly)

So I’ve been having running Qwen3.5 122b intel autoround on my spark and tool calling has always been a problem even with the unsloth fix. Short tasks is fine but recently I’ve been running Hermes Agent and with long tasks, tool calling silently fails.

I stumbled upon a chat template fix on reddit that could very fix this. I’m still testing now but it seems promising:

In addition to the chat template, the author suggested using

--tool-call-parser qwen3_xml

instead of --tool-call-parser qwen3_coder

A very odd workaround i found for it to work in opencode was to change the config to use Anthropic instead of openapi-comatable.

    "spark-anthropic": {
      "npm": "@ai-sdk/anthropic",
      "name": "Spark Anthropic",
      "options": {
        "baseURL": "http://10.1.1.60:8081/v1",
        "chunkTimeout": 18000000
      },
      "models": {
        "qwen3.5-thinking": {
          "id": "Qwen3.5",
          "name": "Qwen3.5 (Thinking)",
          "limit": {
            "context": 262144,
            "output": 65536
          },
          "modalities": { "input": ["text", "image"], "output": ["text"] }
        }
      }
    }

I have no idea why this would make a change, but I’ve been using it that way for about a week, and no issues. I’ve even had session without a mistake run for multiple hours.

With OpenCode I tried qwen3_xml but kept getting the following error

Expected ‘function.name’ to be a string

With --tool-call-parser qwen3_coder and --chat-template /models/qwen3.5-enhanced.jinja tool calls with OpenCode have been flawless

I will test this out on my setup. Sound interesting.

Ok I’m reporting back after a full 12 hours of testing. I was running hermes-agent with llm-wiki skill and had the agent populate my wiki, doing research non-stop for 4 - 6 hours per session.

Using the old --tool-call-parser qwen3_coder with the new chat template resulted in a silent tool call failure after 2 hours. Which is still much better than before where tool calls will fail after a handful of turns.

Using --tool-call-parser qwen3_xml along with the new chat template was the real winner. The session lasted 6 hours and agent finished the task.

I will continue testing this and considered this fixed for the time being.

@eugr how to use qwen3.5-enhanced.jinja with spark-vllm-docker?

Sorry for bothering but for opencode Qwen3_XML seemed to make problems with tool calling for me, Qwen3_Coder works better there in my Experience, did anybody else had the same experience? Is Qwen3_XML better than Qwen3_Coder?

I am running Albonds Qwen3.5 122B Hybrid Autoround with Qwen3_coder parser and Qwens default offial tempalte. Does the jinja Template make it better and more reliable overall? I am running it like this:

docker run -d --name vllm-qwen35
–gpus all --net=host --ipc=host
-e TZ=Europe/Vienna
-v /etc/localtime:/etc/localtime:ro
-v /etc/timezone:/etc/timezone:ro
-v ~/models:/models
vllm-qwen35-v2
serve /models/qwen35-122b-hybrid-int4fp8
–served-model-name qwen
–port 8000
–max-model-len 262144
–gpu-memory-utilization 0.90
–max-num-seqs 4
–load-format fastsafetensors
–reasoning-parser qwen3
–attention-backend FLASHINFER
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:2}’
–enable-auto-tool-choice
–tool-call-parser qwen3_coder

Would you recommend changing it?

With eugr’s solution you just make a mod - in this case a script that copies the template into the container before starting.

And add it to your recipe.

Yes, like the poster above said, you can just make a mod.
Does it really solve the tool calling issues compared to Unsloth chat template I’m currently using in Qwen 3.5 recipes? If I get enough positive feedback, I may just use this template instead.

It does seem to work better. I started running it today, and haven’t had any failed tool-calls yet, with longer running flows in open code. this is with the qwen3_xml tool parser and the enhanced tool calling on a modified version of your qwen3.5-122b-int4-autoround recipe. With the Unsloth one over the openai-compatable api it felt like it broke quite frequently.

I can confirm from my tests that the new template + XML combo makes tool use much more stable. On a 35B model, I was getting tool call failures every single time without fixes. Switching to XML + Unsloth got me to a 50% success rate, but with the new template + XML, all four initial runs were successful. I did hit one failure during further testing, but that’s roughly a 10% failure rate compared to 50% with Unsloth.

It would be great to add this as a separate mode. Instead of replacing Unsloth, we could just add the new template so people have a choice.

The irony is that just as I stabilized tool use for 35B 3.5, version 3.6 dropped. And it looks like 3.6 is as stable out of the box as 3.5 was with all the fixes — I’ve only had one failure in 8 runs so far.

Try without MTP and see if you see better results.

Just want to confirm 3.6 seems fixed the issue and I see consistent tool calls in long agentic sessions, also new model do a lot of parallel tool calls when needed without any errors. So don’t override default model chat template. Moreover, you should use new ‘preserve_thinking’ kwarg, it helps a lot for agentic workflows: prefix caching works and agent avoids repeatable thinking.

Agreed. I also noticed that preserve_thinking makes a massive difference.

Yeah. Same. I get 20 or so t/s with MTP and 77 with a dual node cluster without.

I tried it and backed out because for qwen3.5-122B it was performing worse. The custom jinja alone was the best solution for me.

I use:

docker run -it --name vllm-qwen35 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  --gpus all --net=host --ipc=host \
  -v ~/models:/models \
  -v "$SCRIPT_DIR/chat-template:/chat-template:ro" \
  vllm-qwen35-v2 \
  serve /models/qwen35-35b-hybrid-int4fp8 \
  --served-model-name qwen/qwen3.5 \
  --chat-template /chat-template/qwen3.5-enhanced.jinja \
  --max-model-len 196608 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.88 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --attention-backend FLASHINFER \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --generation-config auto \
  --override-generation-config '{"temperature": 0.7, "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'

The custom template can be found hee: vLLM-Qwen3.5-27B/qwen3.5-enhanced.jinja at main · allanchan339/vLLM-Qwen3.5-27B

I can attest that this has completely fixed my issues of having tool calls leak into the reasoning block. Before it would happen to me fairly regularly then the model would stop as if it were done doing whatever it was doing.

Thanks, I’ll test too and incorporate in Qwen3.5 recipes.