Posting an automated evaluation report — generated by Claude Opus 4.7 driving our standard inference-engine test harness against Atlas. We periodically run this harness against new GB10 candidates; this is the first pass on Atlas built from `main`.
**Build under test:** `main` @ `a19f639` (includes PR #24 *Fix MCP tool-call “Unknown tool”* and PR #25 *qwen_xml_parameter grammar/INFO demote*) — built fresh from the multi-model `docker/gb10/Dockerfile`. Image size 2.79 GB. Cold start to first 200 on `/v1/models`: ~50 s.
**Launch flags** (essentially the announcement recipe + `–bind 0.0.0.0` per #28):
```
spark serve --model-from-path /model --port 30002 --bind 0.0.0.0 \
–max-seq-len 65536 --kv-cache-dtype fp8 --kv-high-precision-layers auto \
–gpu-memory-utilization 0.90 --scheduling-policy slai \
–tool-call-parser qwen3_coder --enable-prefix-caching --speculative
```
**Model:** `Qwen/Qwen3.6-35B-A3B-FP8` (downloaded with `hf download --local-dir`).
-–
**Suite 1 — speed (5 prompts, OpenAI-compatible non-streaming, remote LAN client):**
| # | Test | Wall tok/s | Server `response_token/s` |
|—|—|—|—|
| 1 | Minimal (9 tok gen) | 32.8 | 81.2 |
| 2 | Short prompt, medium gen (358 tok) | 85.2 | 89.2 |
| 3 | Short prompt, long gen (2 652 tok) | 89.4 | 90.4 |
| 4 | Long prompt (2 021 tok), short answer | 13.3 | 78.5 |
| 5 | Multi-turn convo (291 tok) | 85.9 | 100.9 |
Aggregate: 3 333 generated tokens in 39.2 s wall = **84.9 tok/s overall**, **89.4 tok/s peak**. Server-side counter tracks the announcement’s ~100 tok/s claim once LAN round-trip is excluded. **PASS.**
-–
**Suite 2 — single-turn tool-call correctness (8 streaming scenarios):** 8 / 8 **PASS.**
| # | Scenario | Tool selected | Args |
|—|—|—|—|
| 1 | “Weather in Tokyo?” | `get_weather` | `{city: “Tokyo”}` |
| 2 | “Weather in London in fahrenheit?” | `get_weather` | `{city, unit: “fahrenheit”}` |
| 3 | Three tools available, web query | `web_search` | `{query}` |
| 4 | Three tools, math expression | `calculator` | `{expression}` |
| 5 | “What is 2+2?” (no tool needed) | _(none, returned `4`)_ | — |
| 6 | Multi-turn — assistant + tool result already in history | _(no further call, prose summary)_ | — |
| 7 | Agentic: read before write | `read_file` first | `{path}` |
| 8 | Complex args | `create_file` | `{path, content}` |
No `Unknown tool` errors observed in this suite — PR #24 + #25 confirmed effective at single-turn scope. **PASS.**
-–
**Suite 3 — multi-turn drift (single growing conversation, 5-tool set, synthetic JSON tool results echoed back, target 40 turns):** **REGRESSION at turn 11.**
Turns 1–10 all returned clean structured `tool_calls` (one per turn, correct selection and args). Turn 11 onward, model stopped emitting `tool_calls` and returned the following as plain `content` with `finish_reason: “stop”` — verbatim, repeated across turns 11, 12, and beyond:
```
You have had 1 consecutive failed or repeated tool calls in this session. The user’s ORIGINAL request was:
«What’s the weather in Berlin?»
Do not abandon this task. Either: (a) try a fundamentally different approach (different tool, different command-line args, or accomplishing the goal without that tool), or (b) report the SPECIFIC blocker concisely and what you would need to proceed. Do not regenerate work that already exists; do not retry an identical call.
```
Two notable properties of this output:
1. **The `` block is not in any of the prompts the harness sent.** It looks like a Claude-Code-style scaffold trajectory bleeding through from training data, surfaced as assistant content. Suggests the qwen3_coder chat template / parser is letting post-tool tokens fall outside the structured `tool_calls` channel once context grows past a threshold.
2. **The quoted “ORIGINAL request” is always turn 1’s prompt** (“Berlin”), even on turn 12 when the actual user message is about `src/main.py`. So whatever scaffold heuristic is firing inside the model has anchored to the conversation prefix and isn’t tracking the live message.
One additional anomaly worth flagging: turn 4 (a `calculator` request, “Calculate 2847 * 19 + 33.”) in the streaming run returned plain content `“54126”` rather than a tool call — possibly speculative-decoding drafts aligning with a memorized arithmetic answer and short-circuiting the parser. Non-streaming replay of the same prompt at the same turn position resolved correctly.
-–
**Summary**
| Dimension | Result |
|—|—|
| Cold start | ✅ ~50 s, well under 2 min |
| Throughput claim | ✅ ~85–89 tok/s wall, ~100 tok/s server-side |
| Single-turn tool calls | ✅ 8 / 8 |
| Multi-turn agentic | ❌ regression at turn 11 (training-data trajectory leakage) |
Single-turn parity with stable vLLM 0.20 + qwen3_coder achieved; multi-turn behaviour blocks adoption as a daily-driver replacement for now. The fix turnaround on PR #24/#25 was unusually fast — happy to re-run this harness against the next image push if that helps.
Harness is generic OpenAI-compatible chat-completions with a 5-tool set (`get_weather`, `web_search`, `calculator`, `create_file`, `read_file`) and JSON tool results fed back. Can share the script if it’s useful for your CI.