Qwen3.5 Tool Calling finally fixed (possibly)

there is a hope: [Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, end detection, token counting, withhold recovery by ExtReMLapin · Pull Request #40783 · vllm-project/vllm · GitHub
I rebuild @eugr docker image with applied PR40783 and in short testing tool call improved and in-between tools reasoning is not cutting off.

./build-and-copy.sh -t vllm-node --apply-vllm-pr 40783

did you run tool eval bench to compare? that would be helpful with and without the patch.

've tested two custom builds: one with the PR fixes and one without (baseline). Both used the stock template.

First, I ran my task on the image with the fixes: I was glad to see zero crashes on the 122B model.
However, when I tested the image without the fixes, there were also no crashes.
(It seems some recent changes in vLLM might have already addressed the stability issues?)

Tool Evaluation Results:

  • 3.5 122B: 100% (both with and without PR)

  • 3.6 35B: 100% (both with and without PR)

  • 3.6 27B: 100% without PR / 93% with PR

  • Coder-Next: 80% without PR / 83% with PR (different tests failing in each run)

Observations: The PR does affect performance, but it’s most noticeable on Coder-Next (which already had a decent template). Interestingly, it seems to slightly degrade the 27B model’s behavior.

For now, I can’t confirm a clear correlation between my specific task and the tool-eval benchmark. I’ll keep monitoring the situation.

Thanks for providing these options!

tool-eval-bench --short --seed 42
w/oPR
withPR

122b tool eval

Warm-up complete (293 ms**)**
 Engine: vLLM 0.20.2rc1.dev1+g54dc64d5d.d20260503

/models/Qwen/Qwen3.5-122B-A10B-FP8 via vllm @ http://192.168.88.138:8888
│ 15 scenarios v1.5.1 │

● TC-01 Direct Specialist Match ✅ PASS 2/2 **5.**6s ttft=2,181ms t2 Used get_weather with Berlin only.
● TC-02 Distractor Resistance ✅ PASS 2/2 **10.**2s ttft=2,248ms t2 Used only get_stock_price for AAPL.
● TC-03 Implicit Tool Need ✅ PASS 2/2 **7.**6s ttft=2,480ms t3 Looked up Sarah before sending the email.
● TC-04 Unit Handling ✅ PASS 2/2 **5.**6s ttft=2,996ms t2 Requested Tokyo weather in Fahrenheit explicitly.
● TC-05 Date and Time Parsing ✅ PASS 2/2 **11.**3s ttft=6,704ms t2 Parsed next Monday and included the requested meeting details.
● TC-06 Multi-Value Extraction ✅ PASS 2/2 **8.**9s ttft=3,855ms t2 Issued separate translate_text calls for both languages.
● TC-07 Search → Read → Act ✅ PASS 2/2 **13.**2s ttft=2,921ms t5 Completed the full four-step chain with the right data.
● TC-08 Conditional Branching ✅ PASS 2/2 **11.**1s ttft=5,824ms t3 Checked the weather first, then set the rainy-day reminder.
● TC-09 Parallel Independence ✅ PASS 2/2 **16.**0s ttft=2,543ms t2 Handled both independent tasks.
● TC-10 Trivial Knowledge ✅ PASS 2/2 **5.**4s ttft=3,551ms Answered directly without tool use.
● TC-11 Simple Math ✅ PASS 2/2 **3.**9s ttft=2,955ms Did the math directly — good restraint.
● TC-12 Impossible Request ✅ PASS 2/2 **11.**1s ttft=7,443ms Refused cleanly because no delete-email tool exists.
● TC-13 Empty Results ✅ PASS 2/2 **4.**8s ttft=2,197ms t2 Asked for clarification after the empty result.
● TC-14 Malformed Response ✅ PASS 2/2 **4.**9s ttft=2,141ms t2 Acknowledged the stock tool failure and handled it gracefully.
● TC-15 Conflicting Information ✅ PASS 2/2 **8.**5s ttft=2,534ms t3 Used the searched population value in the calculator.

Category Breakdown

│ Tool Selection │ 100% │ ████████████████████ │ 6/6 │
│ Parameter Precision │ 100% │ ████████████████████ │ 6/6 │
│ Multi-Step Chains │ 100% │ ████████████████████ │ 6/6 │
│ Restraint & Refusal │ 100% │ ████████████████████ │ 6/6 │
│ Error Recovery │ 100% │ ████████████████████ │ 6/6 │

│ │
Model: /models/Qwen/Qwen3.5-122B-A10B-FP8 │
Score: 100 / 100
Rating: ★★★★★ Excellent
│ Engine: vLLM 0.20.2rc1.dev1+g54dc64d5d.d20260503 │
│ Quantization: FP8 │
│ Max context: 262,144 tokens │
│ │
│ ✅ 15 passed ⚠ 0 partial ❌ 0 failed │
Points: 30/30 │
│ │
Quality: 100/100 │
Responsiveness: 50/100 (median turn: 3.0s) │
Deployability: 85/100 (α=0.7) │
│ │
│ Completed in 128.3s │ tool-eval-bench v1.5.1 │
│ │
 Token Usage:
│ Total: 33,452 tokens │ Efficiency: 0.9 pts/1K tokens │

**

✓** Warm-up complete (273 ms**)**
 Engine: vLLM 0.20.2rc1.dev22+g458339c47.d20260503

/models/Qwen/Qwen3.5-122B-A10B-FP8 via vllm @ http://192.168.88.138:8888
│ 15 scenarios v1.5.1 │

● TC-01 Direct Specialist Match ✅ PASS 2/2 **5.**6s ttft=2,177ms t2 Used get_weather with Berlin only.
● TC-02 Distractor Resistance ✅ PASS 2/2 **9.**6s ttft=2,248ms t2 Used only get_stock_price for AAPL.
● TC-03 Implicit Tool Need ✅ PASS 2/2 **7.**7s ttft=2,514ms t3 Looked up Sarah before sending the email.
● TC-04 Unit Handling ✅ PASS 2/2 **5.**8s ttft=3,180ms t2 Requested Tokyo weather in Fahrenheit explicitly.
● TC-05 Date and Time Parsing ✅ PASS 2/2 **11.**2s ttft=6,708ms t2 Parsed next Monday and included the requested meeting details.
● TC-06 Multi-Value Extraction ✅ PASS 2/2 **9.**9s ttft=4,082ms t2 Issued separate translate_text calls for both languages.
● TC-07 Search → Read → Act ✅ PASS 2/2 **12.**6s ttft=2,925ms t5 Completed the full four-step chain with the right data.
● TC-08 Conditional Branching ✅ PASS 2/2 **11.**0s ttft=5,809ms t3 Checked the weather first, then set the rainy-day reminder.
● TC-09 Parallel Independence ✅ PASS 2/2 **15.**8s ttft=2,541ms t2 Handled both independent tasks.
● TC-10 Trivial Knowledge ✅ PASS 2/2 **5.**8s ttft=3,714ms Answered directly without tool use.
● TC-11 Simple Math ✅ PASS 2/2 **3.**9s ttft=2,958ms Did the math directly — good restraint.
● TC-12 Impossible Request ✅ PASS 2/2 **11.**2s ttft=7,467ms Refused cleanly because no delete-email tool exists.
● TC-13 Empty Results ✅ PASS 2/2 **5.**4s ttft=2,818ms t2 Asked for clarification after the empty result.
● TC-14 Malformed Response ✅ PASS 2/2 **4.**9s ttft=2,143ms t2 Acknowledged the stock tool failure and handled it gracefully.
● TC-15 Conflicting Information ✅ PASS 2/2 **8.**5s ttft=2,533ms t3 Used the searched population value in the calculator.

Category Breakdown

│ Tool Selection │ 100% │ ████████████████████ │ 6/6 │
│ Parameter Precision │ 100% │ ████████████████████ │ 6/6 │
│ Multi-Step Chains │ 100% │ ████████████████████ │ 6/6 │
│ Restraint & Refusal │ 100% │ ████████████████████ │ 6/6 │
│ Error Recovery │ 100% │ ████████████████████ │ 6/6 │

│ │
Model: /models/Qwen/Qwen3.5-122B-A10B-FP8 │
Score: 100 / 100
Rating: ★★★★★ Excellent
│ Engine: vLLM 0.20.2rc1.dev22+g458339c47.d20260503 │
│ Quantization: FP8 │
│ Max context: 262,144 tokens │
│ │
│ ✅ 15 passed ⚠ 0 partial ❌ 0 failed │
Points: 30/30 │
│ │
Quality: 100/100 │
Responsiveness: 50/100 (median turn: 3.0s) │
Deployability: 85/100 (α=0.7) │
│ │
│ Completed in 129.0s │ tool-eval-bench v1.5.1 │
│ │
 Token Usage:
│ Total: 33,446 tokens │ Efficiency: 0.9 pts/1K tokens │

27b tool eval

Warm-up complete (1645 ms**)**
 Engine: vLLM 0.20.2rc1.dev1+g54dc64d5d.d20260503

/models/Qwen/Qwen3.6-27B-FP8 via vllm @ http://192.168.88.138:8888
│ 15 scenarios v1.5.1 │

● TC-01 Direct Specialist Match ✅ PASS 2/2 **22.**4s ttft=7,267ms t2 Used get_weather with Berlin only.
● TC-02 Distractor Resistance ✅ PASS 2/2 **12.**5s ttft=3,686ms t2 Used only get_stock_price for AAPL.
● TC-03 Implicit Tool Need ✅ PASS 2/2 **21.**3s ttft=5,114ms t3 Looked up Sarah before sending the email.
● TC-04 Unit Handling ✅ PASS 2/2 **9.**0s ttft=2,885ms t2 Requested Tokyo weather in Fahrenheit explicitly.
● TC-05 Date and Time Parsing ✅ PASS 2/2 **59.**6s ttft=27,113ms t3 Parsed next Monday and included the requested meeting details.
● TC-06 Multi-Value Extraction ✅ PASS 2/2 **23.**7s ttft=8,426ms t2 Issued separate translate_text calls for both languages.
● TC-07 Search → Read → Act ✅ PASS 2/2 **45.**1s ttft=8,422ms t5 Completed the full four-step chain with the right data.
● TC-08 Conditional Branching ✅ PASS 2/2 **38.**2s ttft=13,373ms t3 Checked the weather first, then set the rainy-day reminder.
● TC-09 Parallel Independence ✅ PASS 2/2 **23.**5s ttft=6,723ms t2 Handled both independent tasks.
● TC-10 Trivial Knowledge ✅ PASS 2/2 **15.**0s ttft=14,181ms Answered directly without tool use.
● TC-11 Simple Math ✅ PASS 2/2 **26.**6s ttft=25,663ms Did the math directly — good restraint.
● TC-12 Impossible Request ✅ PASS 2/2 **23.**2s ttft=13,560ms Refused cleanly because no delete-email tool exists.
● TC-13 Empty Results ✅ PASS 2/2 **18.**1s ttft=3,069ms t3 Retried after the empty result and recovered.
● TC-14 Malformed Response ✅ PASS 2/2 **14.**1s ttft=2,662ms t2 Acknowledged the stock tool failure and handled it gracefully.
● TC-15 Conflicting Information ✅ PASS 2/2 **24.**0s ttft=3,809ms t3 Used the searched population value in the calculator.

Category Breakdown
│ Tool Selection │ 100% │ ████████████████████ │ 6/6 │
│ Parameter Precision │ 100% │ ████████████████████ │ 6/6 │
│ Multi-Step Chains │ 100% │ ████████████████████ │ 6/6 │
│ Restraint & Refusal │ 100% │ ████████████████████ │ 6/6 │
│ Error Recovery │ 100% │ ████████████████████ │ 6/6 │

│ │
Model: /models/Qwen/Qwen3.6-27B-FP8 │
Score: 100 / 100
Rating: ★★★★★ Excellent
│ Engine: vLLM 0.20.2rc1.dev1+g54dc64d5d.d20260503 │
│ Quantization: FP8 │
│ Max context: 262,144 tokens │
│ │
│ ✅ 15 passed ⚠ 0 partial ❌ 0 failed │
Points: 30/30 │
│ │
Quality: 100/100 │
Responsiveness: 15/100 (median turn: 9.5s) │
Deployability: 74/100 (α=0.7) │
│ │
│ Completed in 376.3s │ tool-eval-bench v1.5.1 │
│ │
 Token Usage:
│ Total: 38,828 tokens │ Efficiency: 0.8 pts/1K tokens │

Warm-up complete (1719 ms**)**
 Engine: vLLM 0.20.2rc1.dev22+g458339c47.d20260503

/models/Qwen/Qwen3.6-27B-FP8 via vllm @ http://192.168.88.138:8888
│ 15 scenarios v1.5.1

● TC-01 Direct Specialist Match ✅ PASS 2/2 **22.**0s ttft=4,729ms t2 Used get_weather with Berlin only.
● TC-02 Distractor Resistance ✅ PASS 2/2 **12.**5s ttft=3,692ms t2 Used only get_stock_price for AAPL.
● TC-03 Implicit Tool Need ✅ PASS 2/2 **19.**5s ttft=5,050ms t3 Looked up Sarah before sending the email.
● TC-04 Unit Handling ✅ PASS 2/2 **8.**6s ttft=2,736ms t2 Requested Tokyo weather in Fahrenheit explicitly.
● TC-05 Date and Time Parsing ✅ PASS 2/2 **39.**8s ttft=19,675ms t2 Parsed next Monday and included the requested meeting details.
● TC-06 Multi-Value Extraction ✅ PASS 2/2 **28.**3s ttft=13,552ms t2 Issued separate translate_text calls for both languages.
● TC-07 Search → Read → Act ✅ PASS 2/2 **45.**3s ttft=8,446ms t5 Completed the full four-step chain with the right data.
● TC-08 Conditional Branching ✅ PASS 2/2 **37.**2s ttft=12,049ms t3 Checked the weather first, then set the rainy-day reminder.
● TC-09 Parallel Independence ✅ PASS 2/2 **31.**3s ttft=6,746ms t2 Handled both independent tasks.
● TC-10 Trivial Knowledge ✅ PASS 2/2 **12.**6s ttft=11,772ms Answered directly without tool use.
● TC-11 Simple Math ✅ PASS 2/2 **31.**3s ttft=30,389ms Did the math directly — good restraint.
● TC-12 Impossible Request ✅ PASS 2/2 **27.**6s ttft=13,540ms Refused cleanly because no delete-email tool exists.
● TC-13 Empty Results ❌ FAIL 0/2 **15.**5s ttft=3,827ms t2 Did not adapt after the empty search response.
● TC-14 Malformed Response ✅ PASS 2/2 **13.**0s ttft=3,690ms t2 Acknowledged the stock tool failure and handled it gracefully.
● TC-15 Conflicting Information ✅ PASS 2/2 **26.**3s ttft=3,823ms t3 Used the searched population value in the calculator.

Category Breakdown

│ Tool Selection │ 100% │ ████████████████████ │ 6/6 │
│ Parameter Precision │ 100% │ ████████████████████ │ 6/6 │
│ Multi-Step Chains │ 100% │ ████████████████████ │ 6/6 │
│ Restraint & Refusal │ 100% │ ████████████████████ │ 6/6 │
│ Error Recovery │ 67% │ █████████████░░░░░░░ │ 4/6 │


Model: /models/Qwen/Qwen3.6-27B-FP8 │
Score: 93 / 100
Rating: ★★★★★ Excellent
│ Engine: vLLM 0.20.2rc1.dev22+g458339c47.d20260503 │
│ Quantization: FP8 │
│ Max context: 262,144 tokens │
│ │
│ ✅ 14 passed ⚠ 0 partial ❌ 1 failed │
Points: 28/30 │
│ │
Quality: 93/100 │
Responsiveness: 14/100 (median turn: 9.9s) │
Deployability: 69/100 (α=0.7) │
Weakest: E Error Recovery (67%) │
│ │
│ Completed in 370.9s │ tool-eval-bench v1.5.1 │
│ │
 Token Usage:
│ Total: 34,360 tokens │ Efficiency: 0.8 pts/1K tokens │

35b tool eval

Warm-up complete (1428 ms**)**
 Engine: vLLM 0.20.2rc1.dev1+g54dc64d5d.d20260503

/models/Qwen/Qwen3.6-35B-A3B-FP8 via vllm @ http://192.168.88.138:8888
│ 15 scenarios v1.5.1 │

● TC-01 Direct Specialist Match ✅ PASS 2/2 **2.**9s ttft=1,292ms t2 Used get_weather with Berlin only.
● TC-02 Distractor Resistance ✅ PASS 2/2 **8.**3s ttft=7,116ms t2 Used only get_stock_price for AAPL.
● TC-03 Implicit Tool Need ✅ PASS 2/2 **3.**5s ttft=936ms t3 Looked up Sarah before sending the email.
● TC-04 Unit Handling ✅ PASS 2/2 **2.**6s ttft=643ms t2 Requested Tokyo weather in Fahrenheit explicitly.
● TC-05 Date and Time Parsing ✅ PASS 2/2 **5.**2s ttft=3,284ms t2 Parsed next Monday and included the requested meeting details.
● TC-06 Multi-Value Extraction ✅ PASS 2/2 **3.**9s ttft=1,577ms t2 Issued separate translate_text calls for both languages.
● TC-07 Search → Read → Act ✅ PASS 2/2 **7.**0s ttft=1,553ms t5 Completed the full four-step chain with the right data.
● TC-08 Conditional Branching ✅ PASS 2/2 **4.**5s ttft=921ms t3 Checked the weather first, then set the rainy-day reminder.
● TC-09 Parallel Independence ✅ PASS 2/2 **5.**4s ttft=1,021ms t2 Handled both independent tasks.
● TC-10 Trivial Knowledge ✅ PASS 2/2 **2.**3s ttft=1,654ms Answered directly without tool use.
● TC-11 Simple Math ✅ PASS 2/2 **7.**1s ttft=6,925ms Did the math directly — good restraint.
● TC-12 Impossible Request ✅ PASS 2/2 **7.**5s ttft=5,639ms Refused cleanly because no delete-email tool exists.
● TC-13 Empty Results ✅ PASS 2/2 **6.**1s ttft=931ms t4 Retried after the empty result and recovered.
● TC-14 Malformed Response ✅ PASS 2/2 **5.**1s ttft=2,798ms t2 Acknowledged the stock tool failure and handled it gracefully.
● TC-15 Conflicting Information ✅ PASS 2/2 **3.**7s ttft=659ms t3 Used the searched population value in the calculator.

Category Breakdown

│ Tool Selection │ 100% │ ████████████████████ │ 6/6 │
│ Parameter Precision │ 100% │ ████████████████████ │ 6/6 │
│ Multi-Step Chains │ 100% │ ████████████████████ │ 6/6 │
│ Restraint & Refusal │ 100% │ ████████████████████ │ 6/6 │
│ Error Recovery │ 100% │ ████████████████████ │ 6/6 │


Model: /models/Qwen/Qwen3.6-35B-A3B-FP8 │
Score: 100 / 100
Rating: ★★★★★ Excellent
│ Engine: vLLM 0.20.2rc1.dev1+g54dc64d5d.d20260503 │
│ Quantization: FP8 │
│ Max context: 262,144 tokens │
│ │
│ ✅ 15 passed ⚠ 0 partial ❌ 0 failed │
Points: 30/30 │
│ │
Quality: 100/100 │
Responsiveness: 75/100 (median turn: 1.4s) │
Deployability: 92/100 (α=0.7) │
│ │
│ Completed in 75.2s │ tool-eval-bench v1.5.1 │
│ │
 Token Usage:
│ Total: 37,508 tokens │ Efficiency: 0.8 pts/1K tokens │
│ │

**

✓** Warm-up complete (1436 ms**)**
 Engine: vLLM 0.20.2rc1.dev22+g458339c47.d20260503

/models/Qwen/Qwen3.6-35B-A3B-FP8 via vllm @ http://192.168.88.138:8888
│ 15 scenarios v1.5.1

● TC-01 Direct Specialist Match ✅ PASS 2/2 **2.**8s ttft=1,272ms t2 Used get_weather with Berlin only.
● TC-02 Distractor Resistance ✅ PASS 2/2 **4.**0s ttft=976ms t2 Used only get_stock_price for AAPL.
● TC-03 Implicit Tool Need ✅ PASS 2/2 **4.**1s ttft=928ms t3 Looked up Sarah before sending the email.
● TC-04 Unit Handling ✅ PASS 2/2 **2.**9s ttft=947ms t2 Requested Tokyo weather in Fahrenheit explicitly.
● TC-05 Date and Time Parsing ✅ PASS 2/2 **18.**7s ttft=13,567ms t3 Parsed next Monday and included the requested meeting details.
● TC-06 Multi-Value Extraction ✅ PASS 2/2 **4.**6s ttft=1,464ms t3 Issued separate translate_text calls for both languages.
● TC-07 Search → Read → Act ✅ PASS 2/2 **7.**0s ttft=1,364ms t5 Completed the full four-step chain with the right data.
● TC-08 Conditional Branching ✅ PASS 2/2 **4.**5s ttft=1,159ms t3 Checked the weather first, then set the rainy-day reminder.
● TC-09 Parallel Independence ✅ PASS 2/2 **9.**4s ttft=997ms t2 Handled both independent tasks.
● TC-10 Trivial Knowledge ✅ PASS 2/2 **1.**6s ttft=1,418ms Answered directly without tool use.
● TC-11 Simple Math ✅ PASS 2/2 **12.**3s ttft=12,161ms Did the math directly — good restraint.
● TC-12 Impossible Request ✅ PASS 2/2 **5.**7s ttft=4,002ms Refused cleanly because no delete-email tool exists.
● TC-13 Empty Results ✅ PASS 2/2 **5.**9s ttft=920ms t4 Retried after the empty result and recovered.
● TC-14 Malformed Response ✅ PASS 2/2 **4.**2s ttft=1,847ms t2 Acknowledged the stock tool failure and handled it gracefully.
● TC-15 Conflicting Information ✅ PASS 2/2 **4.**3s ttft=977ms t3 Used the searched population value in the calculator.

Category Breakdown
│ Tool Selection │ 100% │ ████████████████████ │ 6/6 │
│ Parameter Precision │ 100% │ ████████████████████ │ 6/6 │
│ Multi-Step Chains │ 100% │ ████████████████████ │ 6/6 │
│ Restraint & Refusal │ 100% │ ████████████████████ │ 6/6 │
│ Error Recovery │ 100% │ ████████████████████ │ 6/6 │

│ │
Model: /models/Qwen/Qwen3.6-35B-A3B-FP8 │
Score: 100 / 100
Rating: ★★★★★ Excellent
│ Engine: vLLM 0.20.2rc1.dev22+g458339c47.d20260503 │
│ Quantization: FP8 │
│ Max context: 262,144 tokens │
│ │
│ ✅ 15 passed ⚠ 0 partial ❌ 0 failed │
Points: 30/30 │
│ │
Quality: 100/100 │
Responsiveness: 74/100 (median turn: 1.5s) │
Deployability: 92/100 (α=0.7) │
│ │
│ Completed in 92.1s │ tool-eval-bench v1.5.1 │
│ │
 Token Usage:
│ Total: 42,588 tokens │ Efficiency: 0.7 pts/1K tokens │
│ │

coder-next tool eval

Warm-up complete (151 ms**)**
 Engine: vLLM 0.20.2rc1.dev1+g54dc64d5d.d20260503

/models/Qwen/Qwen3-Coder-Next-FP8 via vllm @ http://192.168.88.138:8888
│ 15 scenarios v1.5.1 │

● TC-01 Direct Specialist Match ✅ PASS 2/2 **2.**3s ttft=233ms t2 Used get_weather with Berlin only.
● TC-02 Distractor Resistance ✅ PASS 2/2 **2.**8s ttft=251ms t2 Used only get_stock_price for AAPL.
● TC-03 Implicit Tool Need ❌ FAIL 0/2 **2.**1s ttft=242ms t2 Did not complete the contact lookup to email chain correctly.
● TC-04 Unit Handling ✅ PASS 2/2 **1.**0s ttft=231ms t2 Requested Tokyo weather in Fahrenheit explicitly.
● TC-05 Date and Time Parsing ✅ PASS 2/2 **2.**4s ttft=250ms t2 Parsed next Monday and included the requested meeting details.
● TC-06 Multi-Value Extraction ❌ FAIL 0/2 **2.**8s ttft=244ms t2 Did not split the translation request into two valid tool calls.
● TC-07 Search → Read → Act ✅ PASS 2/2 **4.**2s ttft=229ms t5 Completed the full four-step chain with the right data.
● TC-08 Conditional Branching ✅ PASS 2/2 **2.**4s ttft=237ms t3 Checked the weather first, then set the rainy-day reminder.
● TC-09 Parallel Independence ✅ PASS 2/2 **2.**4s ttft=236ms t2 Handled both independent tasks.
● TC-10 Trivial Knowledge ✅ PASS 2/2 **0.**3s ttft=151ms Answered directly without tool use.
● TC-11 Simple Math ⚠ PARTIAL 1/2 **0.**9s ttft=233ms t2 Reached for calculator on 15200 — correct answer but mental math was sufficient.
● TC-12 Impossible Request ✅ PASS 2/2 **0.**9s ttft=148ms Refused cleanly because no delete-email tool exists.
● TC-13 Empty Results ✅ PASS 2/2 **1.**4s ttft=229ms t2 Asked for clarification after the empty result.
● TC-14 Malformed Response ⚠ PARTIAL 1/2 **1.**5s ttft=253ms t2 Acknowledged the error but did not attempt an alternative source.
● TC-15 Conflicting Information ✅ PASS 2/2 **3.**5s ttft=250ms t3 Used the searched population value in the calculator.

Category Breakdown
Earned
│ Tool Selection │ 67% │ █████████████░░░░░░░ │ 4/6 │
│ Parameter Precision │ 67% │ █████████████░░░░░░░ │ 4/6 │
│ Multi-Step Chains │ 100% │ ████████████████████ │ 6/6 │
│ Restraint & Refusal │ 83% │ ████████████████░░░░ │ 5/6 │
│ Error Recovery │ 83% │ ████████████████░░░░ │ 5/6 │


Model: /models/Qwen/Qwen3-Coder-Next-FP8 │
Score: 80 / 100
Rating: ★★★★ Good
│ Engine: vLLM 0.20.2rc1.dev1+g54dc64d5d.d20260503 │
│ Quantization: FP8 │
│ Max context: 262,144 tokens │
│ │
│ ✅ 11 passed ⚠ 2 partial ❌ 2 failed │
Points: 24/30 │
│ │
Quality: 80/100 │
Responsiveness: 88/100 (median turn: 0.8s) │
Deployability: 82/100 (α=0.7) │
Weakest: A Tool Selection (67%) │
│ │
│ Completed in 31.1s │ tool-eval-bench v1.5.1 │
│ │
 Token Usage:
│ Total: 37,470 tokens │ Efficiency: 0.6 pts/1K tokens │

**

✓** Warm-up complete (1198 ms**)**
 Engine: vLLM 0.20.2rc1.dev22+g458339c47.d20260503

/models/Qwen/Qwen3-Coder-Next-FP8 via vllm @ http://192.168.88.138:8888
│ 15 scenarios v1.5.1

● TC-01 Direct Specialist Match ✅ PASS 2/2 **3.**5s ttft=981ms t2 Used get_weather with Berlin only.
● TC-02 Distractor Resistance ✅ PASS 2/2 **1.**6s ttft=247ms t2 Used only get_stock_price for AAPL.
● TC-03 Implicit Tool Need ✅ PASS 2/2 **2.**4s ttft=230ms t3 Looked up Sarah before sending the email.
● TC-04 Unit Handling ✅ PASS 2/2 **1.**1s ttft=232ms t2 Requested Tokyo weather in Fahrenheit explicitly.
● TC-05 Date and Time Parsing ❌ FAIL 0/2 **2.**3s ttft=251ms t2 Relative date or time parsing was incorrect.
● TC-06 Multi-Value Extraction ❌ FAIL 0/2 **2.**8s ttft=257ms t2 Did not split the translation request into two valid tool calls.
● TC-07 Search → Read → Act ✅ PASS 2/2 **4.**6s ttft=240ms t5 Completed the full four-step chain with the right data.
● TC-08 Conditional Branching ✅ PASS 2/2 **2.**4s ttft=250ms t3 Checked the weather first, then set the rainy-day reminder.
● TC-09 Parallel Independence ✅ PASS 2/2 **2.**3s ttft=235ms t2 Handled both independent tasks.
● TC-10 Trivial Knowledge ✅ PASS 2/2 **0.**3s ttft=161ms Answered directly without tool use.
● TC-11 Simple Math ⚠ PARTIAL 1/2 **1.**0s ttft=246ms t2 Reached for calculator on 15200 — correct answer but mental math was sufficient.
● TC-12 Impossible Request ✅ PASS 2/2 **0.**9s ttft=154ms Refused cleanly because no delete-email tool exists.
● TC-13 Empty Results ✅ PASS 2/2 **1.**4s ttft=234ms t2 Asked for clarification after the empty result.
● TC-14 Malformed Response ✅ PASS 2/2 **1.**3s ttft=263ms t2 Acknowledged the stock tool failure and handled it gracefully.
● TC-15 Conflicting Information ✅ PASS 2/2 **2.**3s ttft=257ms t3 Used the searched population value in the calculator.

Category Breakdown

│ Tool Selection │ 100% │ ████████████████████ │ 6/6 │
│ Parameter Precision │ 33% │ ██████░░░░░░░░░░░░░░ │ 2/6 │
│ Multi-Step Chains │ 100% │ ████████████████████ │ 6/6 │
│ Restraint & Refusal │ 83% │ ████████████████░░░░ │ 5/6 │
│ Error Recovery │ 100% │ ████████████████████ │ 6/6 │


Model: /models/Qwen/Qwen3-Coder-Next-FP8 │
Score: 83 / 100
Rating: ★★★★ Good
│ Engine: vLLM 0.20.2rc1.dev22+g458339c47.d20260503 │
│ Quantization: FP8 │
│ Max context: 262,144 tokens │
│ │
│ ✅ 12 passed ⚠ 1 partial ❌ 2 failed │
Points: 25/30 │
│ │
Quality: 83/100 │
Responsiveness: 89/100 (median turn: 0.8s) │
Deployability: 85/100 (α=0.7) │
Weakest: B Parameter Precision (33%) │
│ │
│ Completed in 30.0s │ tool-eval-bench v1.5.1 │
│ │
 Token Usage:
│ Total: 39,386 tokens │ Efficiency: 0.6 pts/1K tokens │
│ │

do you test with reasoning on? For me, this pr fixed cases in pi agent when reasoning is interrupted/trimmed if interleaved with tool calls

this tests with reasoning on