Hello,
I can’t understand why no one is saying that Qwen3.6 or vLLM isn’t stable at all. I’ve tried all the vLLM versions and almost all the Qwen3.6 35B models (FP8, INT4, NVFP4 with and without distillation), and they all have the same problem: Endless repetition during reflections, the model starts creating a large file, then once it’s finished, it decides, “Oh, actually, no, I don’t like it, I’ll do it differently,” and this can go on for several times. The same thing happens when it corrects a file; it will correct it multiple times. Sometimes it will even reflect, say something, use a tool, reflect, say the same thing again, use the same tool with the same parameters, and so on dozens of times, even indefinitely.
I tried every possible setting, starting with the recommended one. The last one that seemed stable was `{repetition_penalty:1.1,temperature:0.4}`, but ultimately, after a while, around 30k contexts, it keeps repeating.
This happens with or without `preserve_thinking`, whether in Claude Code, VS Code, or even custom-built assistants.
My second ongoing issue, whether it’s vLLM Nightly or even the latest stable version vLLM 0.20 and vLLM eugr (which is based on the nightly version), is the tool calls. The tools end up as XML in the thinking_content output, forcing me to patch them everywhere. Qwen has been releasing templates for two years, and no one has been able to get vLLM working out of the box with this fix. I don’t understand…
Could someone please create a ZIP file containing all the patches or commands needed to make it work properly? I’m starting to despair :(