Introducing Tool Eval Bench CLI

What DColt discovered is that switching OpenCodeโ€™s client SDK from @ai-sdk/openai to @ai-sdk/anthropic changed how the SDK serializes tool definitions and parses responses before/after talking to vLLM.

Just a small clarification. I never tried the @ai-sdk/openai, itโ€™s the @ai-sdk/openai-compatable i used originally.

I did never consider it being a potential bug in the actual SDK, but it did indeed feel very odd that the change of endpoint (/v1/chat/completions to /v1/messages) would have made an actual difference in VLLM.

Thank you for solving that mystery for me!

Has anyone figured out what happened to prefix-caching in the latest few vLLM 20 images? It has broken most tool calls if it is enabled. I have not been able to trace that to anything specific other than turn it off.

This is a very useful tool! It really helps identify issues before running actual tasks.

Iโ€™ve noticed that in the latest vLLM versions, tool calls break specifically when MTP is enabled for the 27B model.
When running with Prefix Caching but without MTP, both 3.6-35B and 3.6-27B pass the test with 100/100 tool calls.

It seems that the issue might not be with the tools themselves, but rather a regression in the Prefix Caching + MTP combination in the new vLLM builds.

So far, I havenโ€™t encountered any tool-related issues on the latest vLLM as long as MTP is disabled.

Awesome useful Benchmark tool.

Would love to have a fusion of
Tool Eval Bench (Quality) and Lama Benchy (Speed)

Maybe together with a nicer interface maybe a gui or charmbracelet/gum like in OpenClaw.

Of course I would prefer if eugr and serapis fuse their work.

Till then I have started working on a launcher for both benchmark-models.sh

This is still work in progress.

you can run it with all the options that are offered by lama-benchy and/or tool-eval-bench

If you do not run the same tests over and over again and do not know the choices.. just launch the script and it gives you a menu

There are so many options and a newbie noob user simply does not know how and what to test for. Being presented with a menu to choose from would be nice.

Like the most common Benchmarks - maybe also include the time it takes until a model is loaded - and if desired share the results on spark-arena.com

If you do not add any options (appart from a model) to the command it gives you the menu, if you give it settings for lama benchy and or tool eval bench it runs the test without the step of the menu.

./benchmark-models.sh

============================================================
  DGX Spark Benchmark โ€” Interactive Setup
============================================================
  No flags given โ€” launching the guided setup.
  (Pass any flag, e.g. --quick, or BENCH_NO_WIZARD=1 to skip. --help for CLI.)
  Tip: install gum for an arrow-key TUI:
       https://github.com/charmbracelet/gum#installation

โ—‡ Step 1/4 โ€” Which models? (13 available)

Pick models (space to toggle)
  [ 1] [x] GPT-OSS-120B
  [ 2] [ ] Nemotron-3-Nano-30B-A3B-NVFP4
  [ 3] [x] Nemotron-3-Nano-4B-FP8
  [ 4] [x] Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4
  [ 5] [ ] Nemotron-3-Super-120B-A12B-NVFP4
  [ 6] [x] Qwen3-Coder-Next-FP8-Dynamic
  [ 7] [ ] Qwen3-Coder-Next-int4-AutoRound
  [ 8] [x] Qwen3-Omni-30B-A3B-Instruct
  [ 9] [ ] Qwen3-VL-30B-A3B-Instruct-FP8
  [10] [x] Qwen3.5-122B-A10B-int4-AutoRound
  [11] [x] Qwen3.5-35B-A3B-FP8
  [12] [x] Qwen3.6-35B-A3B-FP8
  [13] [x] Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive

  Enter numbers to toggle (e.g. "1 3 5"), "a"=all, "n"=none, ENTER=done
  > 2 5 7 9
  > 
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ 9 models selected
  โ€ข GPT-OSS-120B
  โ€ข Nemotron-3-Nano-4B-FP8
  โ€ข Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4
  โ€ข Qwen3-Coder-Next-FP8-Dynamic
  โ€ข Qwen3-Omni-30B-A3B-Instruct
  โ€ข Qwen3.5-122B-A10B-int4-AutoRound
  โ€ข Qwen3.5-35B-A3B-FP8
  โ€ข Qwen3.6-35B-A3B-FP8
  โ€ข Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

โ—‡ Step 2/4 โ€” Which tests?

Test mode
  [1] Speed only         โ€” llama-benchy (pp/tg/depth)  (default)
  [2] Quality only       โ€” tool-eval-bench (tool-call accuracy)
  [3] Speed AND Quality  โ€” both passes on the same loaded model
  > 3
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ llama-benchy (speed)  tool-eval-bench (quality)
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

โ—‡ Step 3/4 โ€” Speed profile (llama-benchy)

Profile
  [1] Medium Log    โ€” pp2048 tg128 depth=0,16384  3 runs   (default, ~5 min/model)  (default)
  [2] Quick smoke   โ€” pp2048 tg128 depth=0,16384  1 run    (fast sanity check)
  [3] Stress        โ€” adds depth=32768  3 runs              (find memory bottleneck)
  [4] Extreme       โ€” adds depth=65535  3 runs              (~200 page corpus)
  [5] Full sweep    โ€” pp512+2048 tg128+256+512 depths 0-32k (broad)
  [6] Arena         โ€” official spark-arena.com profile      (leaderboard submission)
  > 1
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ”‚ Medium Log (default)
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

โ—‡ Step 3/4 โ€” Quality mode (tool-eval-bench)

Quality mode
  [1] Short    โ€” 15 core scenarios   (~2-5 min/model)  (default)
  [2] Full     โ€” 69 scenarios        (~15-30 min/model)
  [3] Hardmode โ€” full + 5 adversarial scenarios

Yes, I figured this out last night too. For me it is very bad when using dflash and a bit better if MTP is used. The issue goes away if there is no speculative decoding.

@eugr / @eugr_nv and I already collaborate! The --perf-integration uses llama-benchy under the hood (and visibly says so in the README.md and the actual tool) for the performance benchmarks. llama-benchy really sets the benchmark for performance testing model performance for this ecosystem and Iโ€™m glad to support it.

Weโ€™re also discussing deeper integration into the sparkrun ecosystem with @raphael.amorim โ€“ while I am not sure how that will look like in detail, I agree that things work best when they are compatible across the broader tooling ecosystem.

sparkrun maintainer here! Yup. Iโ€™ve discussed with @raphael.amorim and weโ€™re working on integration into sparkrun and spark-arena.

Kind hits home weirdly to see the dual-identity: @eugr / @eugr_nv โ€ฆ will get used to it soon though ;-)

CLI support in sparkrun is coming soon as precursor to integrated benchmark uploads to spark-arena. Also working on a bunch of improvements to the general flow for spark arena benchmarks. Hoping to get some time this week to getting some of that out there for people to test.

While we are on the topic of sparkrun I would like to ask a question. I am sure this is something I have done wrong, but everything broke for me recently with sparkrun unless I use the ray backend. Every recipe no longer works. Nothing changed other than an update to the latest sparkrun and a DGX update from Nvidia. I have had to switch to v1 recipes and am using eugr scripts that work perfectly with โ€”no-ray so I know its not a hardware issue. Thanks for this great toolset BTW.

Iโ€™ll send you PM so we donโ€™t hijack the thread.

@serapis you should thank your lucky stars your not in the same physical location as me right now - I could kiss you for this โ€ฆ

Running two configurations side by side and having this dashboard is a gift - thank you!

  • Paul

Haha. Iโ€™m glad you like this, too. It became my favorite thing to watch while my agents are getting work done. Iโ€™m a little weird that way ;-)

This has been exceptionally valuable for taking the guess work out of fine tuning parameters.

Could you please help me clarify something regarding the testing process?

When I run tests without specifying the context pressure, many models perform well (either fully or partially). However, once I apply the following parameters:
--short --seed 42 --context-pressure 0.75 --context-size [model-specific size]

Almost all models (Qwen 122B, Qwen 27B, Qwen 397B, MiniMax M2.7) fail every single test (0/15 pass). Interestingly, only Qwen 3.6-35B managed to complete 13 out of 15.

What am I missing here? Are there some nuances to how context pressure affects tool calling that Iโ€™m not accounting for, or are my expectations for these models incorrect?

Thatโ€™s exactly what this feature is supposed to do. Fill up the context window and see how stable the model behaves. Some will do better, some worse. Given that there are too many parameters in play, I canโ€™t really produce an answer that tells you why model A doesnโ€™t behave like model B under different circumstances.

Thanks for this feature! Itโ€™s really convenient for research.

We all know that as context scales up, the quality of responses tends to degrade for many models (as seen on contextarena.ai). However, I was genuinely surprised by the performance gap within the same model familyโ€”specifically how the 27B model fails while the 35B handles it fine.

Iโ€™ll try to gather more data for a deeper analysis. Thanks again for providing such a great research tool!

Are you using the same params for 27B and 35B?

Yes, I am definitely using identical (and most stable) parameters for both:
Same vLLM version, FP8 model quantization, BF16 KV cache, and MTP disabled.

Thatโ€™s why the discrepancy in their behavior seemed so strange to me. I need to investigate this furtherโ€”the toolโ€™s convenience and capabilities make it much easier, but I just need some more time to gather the data.

launch details

VLLM_SPARK_EXTRA_DOCKER_ARGS=โ€œ-v $HOME/DATA/hf/models/:/modelsโ€ ./launch-cluster.sh --no-ray -t vllm-node-201-v201-fi069-tf5:latest --apply-mod mods/drop-caches exec vllm serve -tp 2 --distributed-executor-backend ray --model /models/Qwen/Qwen3.6-35B-A3B-FP8 --max-model-len auto --gpu-memory-utilization 0.8 --port 8888 --host 0.0.0.0 --load-format instanttensor --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder --trust-remote-code --reasoning-parser qwen3 --served-model-name my-qwen35 --attention-backend flashinfer --override-generation-config โ€˜{โ€œtemperatureโ€: 0.6, โ€œtop_pโ€: 0.95, โ€œtop_kโ€: 20, โ€œmin_pโ€: 0.0, โ€œpresence_penaltyโ€: 0.0, โ€œrepetition_penaltyโ€: 1.0}โ€™ --max-num-batched-tokens 32768 --default-chat-template-kwargs โ€˜{โ€œpreserve_thinkingโ€: true}โ€™

VLLM_SPARK_EXTRA_DOCKER_ARGS=โ€œ-v $HOME/DATA/hf/models/:/modelsโ€ ./launch-cluster.sh --no-ray -t vllm-node-201-v201-fi069-tf5:latest --apply-mod mods/drop-caches exec vllm serve -tp 2 --distributed-executor-backend ray --model /models/Qwen/Qwen3.6-27B-FP8 --max-model-len auto --gpu-memory-utilization 0.9 --port 8888 --host 0.0.0.0 --load-format instanttensor --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder --trust-remote-code --reasoning-parser qwen3 --served-model-name my-qwen35 --attention-backend flashinfer --override-generation-config โ€˜{โ€œtemperatureโ€: 0.6, โ€œtop_pโ€: 0.95, โ€œtop_kโ€: 20, โ€œmin_pโ€: 0.0, โ€œpresence_penaltyโ€: 0.0, โ€œrepetition_penaltyโ€: 1.0}โ€™ --max-num-batched-tokens 32768 --default-chat-template-kwargs โ€˜{โ€œpreserve_thinkingโ€: true}โ€™

Some models/quants lately run into issues with prefix caching in vLLM. It might be worth a shot to disable it and see if it helps

Iโ€™m getting some very strange, repetitive results: the tests consistently alternate between passing and failing.
tool-eval-bench --short --seed 42 --context-pressure-sweep 0.3-1.0 --scenarios TC-01 --sweep-steps 14 --context-size 260000

Iโ€™ve tried the following, but the outcome remains the same:

  • Running Qwen 397B both with and without prefix caching.

  • Testing a completely different architecture โ€” MiniMax.

  • Switching from a self-built community image to the official NVIDIA version.

Nothing seems to change the resultโ€ฆ

โšก Sweep 1/14: 30% pressure โ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 2/14: 35% pressure โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 3/14: 41% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 4/14: 46% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 5/14: 52% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 6/14: 57% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 7/14: 62% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ^C
Interrupted.

logs

tool-eval-bench --short --seed 42 --context-pressure-sweep 0.3-1.0 --scenarios TC-01 --sweep-steps 14 --context-size 260000 --base-url http://192.168.88.138:8888

๐Ÿ”ง Tool-Call Benchmark
Server: http://192.168.88.138:8888
Querying http://192.168.88.138:8888/v1/models โ€ฆ โœ“ /models/qwen (alias: my-qwen35)

โœ“ Warm-up complete (295 ms)
๐Ÿ” Engine: vLLM 0.20.2.dev0+g132765e35.d20260506

โšก Context Pressure Sweep โ€” /models/qwen
Backend: vllm | Server: http://192.168.88.138:8888
Range: 30% โ†’ 100% | 14 levels | 1 scenario

Context window: 260,000 tokens

โšก Sweep 1/14: 30% pressure โ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 2/14: 35% pressure โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 3/14: 41% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 4/14: 46% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 5/14: 52% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 6/14: 57% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 7/14: 62% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ^C
Interrupted.

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โšก Context Pressure Sweep Results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ โ”‚
โ”‚ TC-01 โ”‚
โ”‚ 30% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 35% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ 41% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 46% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ 52% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 57% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ โ”‚
โ”‚ Breaking point: 52% (all scenarios pass) โ”‚
โ”‚ Degradation: 35% (first partial/fail) โ”‚
โ”‚ โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

tool-eval-bench --short --seed 42 --context-pressure-sweep 0.3-1.0 --scenarios TC-01 --sweep-steps 14 --context-size 260000 --base-url http://192.168.88.138:8888

๐Ÿ”ง Tool-Call Benchmark
Server: http://192.168.88.138:8888
Querying http://192.168.88.138:8888/v1/models โ€ฆ โœ“ /models/qwen (alias: my-qwen35)

โœ“ Warm-up complete (1813 ms)
๐Ÿ” Engine: vLLM 0.20.2.dev0+g132765e35.d20260506

โšก Context Pressure Sweep โ€” /models/qwen
Backend: vllm | Server: http://192.168.88.138:8888
Range: 30% โ†’ 100% | 14 levels | 1 scenario

Context window: 260,000 tokens

โšก Sweep 1/14: 30% pressure โ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 2/14: 35% pressure โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 3/14: 41% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 4/14: 46% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 5/14: 52% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 6/14: 57% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 7/14: 62% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ^C
Interrupted.

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โšก Context Pressure Sweep Results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ โ”‚
โ”‚ TC-01 โ”‚
โ”‚ 30% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 35% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ 41% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 46% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ 52% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 57% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ โ”‚
โ”‚ Breaking point: 52% (all scenarios pass) โ”‚
โ”‚ Degradation: 35% (first partial/fail) โ”‚
โ”‚ โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

tool-eval-bench --short --seed 42 --context-pressure-sweep 0.3-1.0 --scenarios TC-01 --sweep-steps 14 --context-size 260000 --base-url http://192.168.88.138:8888

๐Ÿ”ง Tool-Call Benchmark
Server: http://192.168.88.138:8888
Querying http://192.168.88.138:8888/v1/models โ€ฆ โœ“ /models/qwen (alias: my-minimax)

โœ“ Warm-up complete (20302 ms โ€” JIT/CUDA graph compilation on first request)
๐Ÿ” Engine: vLLM 0.20.2.dev0+g132765e35.d20260506

โšก Context Pressure Sweep โ€” /models/qwen
Backend: vllm | Server: http://192.168.88.138:8888
Range: 30% โ†’ 100% | 14 levels | 1 scenario

Context window: 260,000 tokens

โšก Sweep 1/14: 30% pressure โ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 2/14: 35% pressure โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 3/14: 41% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 4/14: 46% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 5/14: 52% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 6/14: 57% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 7/14: 62% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ^C
Interrupted.

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โšก Context Pressure Sweep Results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ โ”‚
โ”‚ TC-01 โ”‚
โ”‚ 30% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 35% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ 41% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 46% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ 52% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 57% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ โ”‚
โ”‚ Breaking point: 52% (all scenarios pass) โ”‚
โ”‚ Degradation: 35% (first partial/fail) โ”‚
โ”‚ โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
tool-eval-bench --short --seed 42 --context-pressure-sweep 0.3-1.0 --scenarios TC-01 --sweep-steps 14 --context-size 260000 --base-url http://192.168.88.138:8888

๐Ÿ”ง Tool-Call Benchmark
Server: http://192.168.88.138:8888
Querying http://192.168.88.138:8888/v1/models โ€ฆ โœ“ /models/Qwen/Qwen3-Coder-Next-FP8 (alias: my-qwen)

โœ“ Warm-up complete (756 ms)
๐Ÿ” Engine: vLLM 0.19.0+6bc3197f.nv26.04.48680843

โšก Context Pressure Sweep โ€” /models/Qwen/Qwen3-Coder-Next-FP8
Backend: vllm | Server: http://192.168.88.138:8888
Range: 30% โ†’ 100% | 14 levels | 1 scenario

Context window: 260,000 tokens

โšก Sweep 1/14: 30% pressure โ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 2/14: 35% pressure โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 3/14: 41% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 4/14: 46% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 5/14: 52% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โœ… 100%
โšก Sweep 6/14: 57% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โŒ 0%
โšก Sweep 7/14: 62% pressure โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ^C
Interrupted.

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โšก Context Pressure Sweep Results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ โ”‚
โ”‚ TC-01 โ”‚
โ”‚ 30% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 35% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ 41% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 46% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ 52% โœ… 100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚ 57% โŒ 0% โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚
โ”‚ โ”‚
โ”‚ Breaking point: 52% (all scenarios pass) โ”‚
โ”‚ Degradation: 35% (first partial/fail) โ”‚
โ”‚ โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

launch commands

796 VLLM_SPARK_EXTRA_DOCKER_ARGS=โ€œ-v $HOME/DATA/hf/models/Intel/Qwen3.5-397B-A17B-int4-AutoRound/:/models/qwenโ€ ./launch-cluster.sh โ€“
no-ray --apply-mod mods/drop-caches -t vllm-node-201-v201-fi069-tf5:latest -e VLLM_MARLIN_USE_ATOMIC_ADD=1 exec vllm serve --model /
models/qwen --max-model-len auto --gpu-memory-utilization 0.926 --port 8888 --host 0.0.0.0 --enable-prefix-caching --max-num-batched-to
kens 4176 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --trust-remote-code --served-model-name my-qw
en35 -tp 2 --max-num-seqs 2 --distributed-executor-backend ray --load-format instanttensor --language-model-only --attention-backend fla
shinfer
797 VLLM_SPARK_EXTRA_DOCKER_ARGS=โ€œ-v $HOME/DATA/hf/models/Intel/Qwen3.5-397B-A17B-int4-AutoRound/:/models/qwenโ€ ./launch-cluster.sh โ€“
no-ray --apply-mod mods/drop-caches -t vllm-node-201-v201-fi069-tf5:latest -e VLLM_MARLIN_USE_ATOMIC_ADD=1 exec vllm serve --model /
models/qwen --max-model-len auto --gpu-memory-utilization 0.926 --port 8888 --host 0.0.0.0 --max-num-batched-tokens 4176 --enable-auto-t
ool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --trust-remote-code --served-model-name my-qwen35 -tp 2 --max-num-seq
s 2 --distributed-executor-backend ray --load-format instanttensor --language-model-only --attention-backend flashinfer
798 VLLM_SPARK_EXTRA_DOCKER_ARGS=โ€œ-v $HOME/DATA/hf/models/cyankiwi/MiniMax-M2.7-AWQ-4bit/:/models/qwenโ€ ./launch-cluster.sh -t vllm-
node-201-v201-fi069-tf5:latest -e VLLM_MARLIN_USE_ATOMIC_ADD=1 exec vllm serve --model /models/qwen --max-model-len auto --gpu-memory-ut
ilization 0.9 --port 8888 --host 0.0.0.0 --load-format instanttensor --enable-prefix-caching --enable-auto-tool-choice --tool-call-parse
r minimax_m2 --reasoning-parser minimax_m2 --served-model-name my-minimax -tp 2 --distributed-executor-backend ray --kv-cache-dtype bfloa
t16
799 VLLM_SPARK_EXTRA_DOCKER_ARGS=โ€œ-v $HOME/DATA/hf/models/:/modelsโ€ ./launch-cluster.sh --no-ray -t nvcr.io/nvidia/vllm:26.04-py3 e
xec vllm serve -tp 2 --distributed-executor-backend ray --model /models/Qwen/Qwen3-Coder-Next-FP8 --max-model-len auto --gpu-memory-utili
zation 0.8 --port 8888 --host 0.0.0.0 --enable-auto-tool-choice --tool-call-parser qwen3_coder --served-model-name my-qwen --attention-
backend flashinfer
800 history 5

Its almost like the test is leaving some residual side effect on the model. If you repeat 35% does the error reoccur or is it a sequence effect. Looks suspiciously like a concurrency, race condition bug somewhere in the test stack.