Bfloat16 Quality = Speed?

I have been playing around with lots of interesting new models and techniques these past few weeks, and somehow lost sight of an important lesson I learned last time I saw a bright shiny new capability and started venturing down rabbit holes after it. Token per second benchmarks are meaningless. Fast tokens are useless for coding when they are low quality.

Quality = Speed

Spending hours fighting with a fast model setup isn’t coding, its just busy work IMO.

Basically: even thought bfloat 16 is expensive, we end up doing so little of it with each interaction that it ends up being relatively cheap. Where the losses occur also matters. They happen in what the model knows, and its understanding of how to do things. Not in the instructions you give it or the code it loads into the context. The benefits of higher quality compound. Better reasoning, less wastes effort, less oversight, leading to a quicker overall result and a more pleasant day.

This is the recipe I was using as a daily driver since Intel/Qwen3.5-122B-A10B-int4-AutoRound was release. This is my preferred setup. It is intended for long running coding tasks. It runs at about 24 t/s all day.

spark-vllm-docker/recipes/qwen3.5-122b.yaml

# Recipe: Qwen3.5-122B-A10B-iNT4-Autoround
# Qwen3.5-122B model in Intel INT4-Autoround quantization

recipe_version: "1"
name: Qwen3.5-122B-A10B-int4-AutoRound
description: vLLM serving Qwen3.5-122B-A10B-int4-AutoRound

# HuggingFace model to download (optional, for --download-model)
model: Intel/Qwen3.5-122B-A10B-int4-AutoRound

solo_only: true

# Container image to use
container: vllm-node-tf5

build_args:
  - --tf5

mods:
  - mods/fix-qwen3.5-enhanced-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 196608
  gpu_memory_utilization: 0.82
  max_num_batched_tokens: 32768
  max-num-seqs: 8
  served_model_name: qwen/qwen3.5
  coding_coding: '{"temperature": 0.7, "min_p": 0.05, "top_p": 1.0, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'
  writing_config: '{"temperature": 0.9, "min_p": 0.05, "top_p": 1.0, "top_k": 20, "presence_penalty": 1.5, "repetition_penalty": 1.1}'

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --dtype bfloat16 \
  --port {port} \
  --host {host} \
  --load-format instanttensor \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --chat-template qwen3.5-enhanced.jinja \
  --reasoning-parser qwen3 \
  --generation-config auto \
  --override-generation-config '{coding_coding}'

Keeping quality loss at a minimum is why I think it works so well. Let me explain…

Community Supported

This recipe runs on @eugr’s spark-vllm-docker a vLLM docker optimized for the DGX spark

AutoRound

This is the only source of quality loss in this recipe. This loss is in what the model knows, and its understanding of how to do things – not the context.

Intel’s AutoRound works exceptionally well on the DGX Spark. Mixture-of-Experts models are notoriously sensitive to quantisation. AutoRound uses signed gradient descent to jointly optimise weight rounding and clipping ranges. This preserves the β€œdistribution” of the weights rather than just the values, keeping the MoE logic intact even at 4-bit. The weights effectively halve the model size. The Blackwell GPU then needs less bandwidth to pull these weights from the unified pool. Once they reach the GPU, the Tensor Cores dequantises INT4 weights into bfloat16 on-the-fly for the actual math, giving the speed of 4-bit with the precision of 16-bit.

Loading the model requires 63GiB of memory, leaving ample room for the KV cache.

cd spark-vllm-docker
./hf-download.sh "Intel/Qwen3.5-122B-A10B-int4-AutoRound"

dtype bfloat16

We make room with the quantisation so we can maintain quality in the KV cache.

The GB10 chip inside the DGX Spark sm_121 architecture has dedicated 5th Gen Tensor Cores designed for bfloat16 and FP4. While the weights are stored as INT4, the Tensor Cores perform the actual matrix multiplications in bfloat16 which maintains the high numerical stability required by the Mixture of Experts (MoE) architecture. Even though the model has 122B parameters, it only activates a 10B of them for any single token, significantly reducing the size of the activations (the temporary math) moving through the chips during the forward pass, leaving β€œbreathing room” for everything else.

Prefix caching

OpenCode sends mostly the same prompt repeatedly with new context appended to it, vLLM’s Prefix Caching keeps those tokens in memory and reuses them. It doesn’t create a new KV cache for the context window every time, which saves a massive amount of space. On normal coding tasks the prefix cache hit rate raises to over 90% very quickly. With 32K token batches, you get maximum available throughput. Because vLLM is caching nearly the entire context, it isn’t recalculating or duplicating the KV cache for the bulk of your data (your code) every time you interact with it. Reducing memory overhead and time to first token significantly. As the context grows, vLLM dynamically leverages the remaining memory pool for the KV Cache.

What Quality Looks Like

Where I see the difference is at 130k I can still give the model instructions and it follows them. It can still provide insights about the problem its solving. It can change direction. It remains on task. I can leave it to finish up on its own.

Qwen3.5 Tool Call Fix

This makes tool calling relatively flawless.

Download qwen3.5-enhanced.jinja

Create a mod directory in spark-vllm-docker/mods/fix-qwen3.5-enhanced-chat-template with the following files

  • qwen3.5-enhanced.jinja
  • run.sh
#!/bin/bash
set -e
cp qwen3.5-enhanced.jinja $WORKSPACE_DIR/qwen3.5-enhanced.jinja
echo "=======> to apply chat template, use --chat-template qwen3.5-enhanced.jinja"

Use either:

  • --tool-call-parser qwen3_coder for OpenCode
  • --tool-call-parser qwen3_xml for other coding harnesses

Launch it

run-qwen.sh

#!/bin/bash
cd ~/spark-vllm-docker
./run-recipe.sh qwen3.5-122b

stop-qwen.sh

#!/bin/bash                                                                                                               
docker stop $(docker ps -q --filter "name=vllm")

OpenCode setup

.config/opencode/opencode.json

{
  "$schema": "https://opencode.ai/config.json",
  "model": "qwen/qwen3.5",
  "provider": {
    "local-vllm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "vLLM (local)",
      "options": {
        "baseURL": "http://dgx-spark.local:8000/v1",
        "apiKey": "dummy-key"
      },
      "models": {
        "qwen/qwen3.5": {
          "name": "Qwen 3.5 (local)",
          "tool_call": true,
          "reasoning": true,
          "limit": {
            "context": 196608,
            "output": 16384
          },
          "modalities": {
            "input": [
              "text",
              "image"
            ],
            "output": [
              "text"
            ]
          },
        }
      }
    }
  }
}

This is why I started collecting some benchmarks. The focus on eeking a few more tokens per second by switching to a different quant without any kind of accuracy measurements seems odd to me.

These evals are far from perfect, but they should be better than nothing. If we run enough of them, then it might make it easier to see what accuracy impact that switch from NVFP4 to AutoRound, or quantizing the kv-cache might be having.

I saw your DGX Spark evals on the other page, I agree with your concept, but personally if it were more simple to setup and run I would do it in a heart beat, this is a bit too far outside my wheehouse right now.

My problem is, when you have a 6 phase plan, 100K tokens spent, you want a model that can finish the job, not drop the ball. Which benchmark can measure that? I can do it live on my desk, load one model after another, see what happens, roll back, do it again. You get a pretty insightful view into their characteristics.

The smaller 35B/A3B FP8 + bfloat16KV work well up to a point, but can’t handle the longer context. 122B/A10B Int4 + bfloat 16KV can.

Yeah, it’s definitely a bit fiddly. I wanted to keep it completely isolated (so it isn’t messing with anything on the host) and originally used a Docker container (which was a bit simpler), but too many evals need to spawn their own docker containers, so I had to change it to a VM.

It probably could be simplified (with a script, or if there’s a way to build images for multipass), but I want to try and collect some more numbers to see how useful it is in practice before spending too much effort on that. Compiling numbers for existing models is quite slow, but hopefully things will be easier when there isn’t such a β€œbacklog”!

Thanks for the write-up! I don’t understand either the obsession with getting the fastest t/s at all cost :) and we have similar objectives it seems, apart from the fact that I’m trying to use 1 multimodal model for everything so that I can switch task without having to make any modification to the setup.

Here is some feedback, if I may.

You can drop this mod: mods/fix-qwen3.5-autoround. It’s now included in the docker.

You can enhance your opencode setup by directly putting the model parameters in the (sub-)agent (plan, build etc.) definition and any additional agent that you define can include it’s own set of model params.
That gives you quick access to all these different types of agents straight from OpenCode, tuned like you need to have them.

Regarding benchmarking, the hardest part is to be able to verify the end result, but I’m wondering if we could use an LLM to compare the final answer with one which is considered valid
I’ve started to build a benchmarking dataset to measure speed differences between configuration. The baseline is public data which uses real coding conversations with 3 turns each and I’ve added 2 turns + system context. At ~30k context at the last turn, that would still not be enough for you, but I’m sure you could easily reach 6 turns and 100k context.
And at the end of the process you could introduce the validation part.

Thanks I removed that from my original post.

I use it with a lot of different coding harnesses, so having a β€˜known’ default temp for Qwen for coding takes the guesswork and configuration overhead out of it for me.


I was discussing this with a client this morning. I think the real benefit of using bfloat 16 is that it preserves quality where it actually matters. The coding harness and tool calls make up for losses in knowledge and knowhow. Nothing can make up for forgetting instructions or important details. The attention head / context window carries so much influence on model behaviour, these losses compound. Even FP8 KV is noticeable on the remaining task after 130k context.

   Context Window etc.
   bfloat 16 - highest quality
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Prefix Cache                    > 90%                          β”‚
β”‚  - System prompt                                                β”‚
β”‚  - Tool instructions                                            β”‚
β”‚  - User prompt                                                  β”‚
β”‚  - Tool call results                                            β”‚
β”‚  - Context is 30x more influential on model behavior            β”‚
β”‚                                                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  New Context (Actual Work)       < 10%                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

   Qwen3.5 122B A10B
   INT4 AutoRound - small quantization loss
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                 β”‚
β”‚  What the model knows                                           β”‚
β”‚  What pre-training taught it to do                              β”‚
β”‚                                                                 β”‚
β”‚                                                                 β”‚
β”‚                                                                 β”‚
β”‚                                                                 β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Not slow, not fast


╔══════════════════════════════════════════════════════╗
β•‘  Benchmark: Qwen3.5  122B /w bfloat 16 β€”  2026-04-17
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   9.43s = 27.1 tok/s
  [Code      ]   512 tokens in  18.84s = 27.1 tok/s
  [JSON      ]  1024 tokens in  37.54s = 27.2 tok/s
  [Math      ]    32 tokens in   1.26s = 25.2 tok/s
  [LongCode  ]  2048 tokens in  75.16s = 27.2 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   9.49s = 26.9 tok/s
  [Code      ]   512 tokens in  18.84s = 27.1 tok/s
  [JSON      ]  1024 tokens in  37.71s = 27.1 tok/s
  [Math      ]    32 tokens in   1.28s = 24.9 tok/s
  [LongCode  ]  2048 tokens in  75.48s = 27.1 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 18.5 tok/s (end-to-end)
  [req2 ]  1024 tokens = 18.5 tok/s (end-to-end)
  [req3 ]  1024 tokens = 18.5 tok/s (end-to-end)
  [req4 ]  1024 tokens = 18.5 tok/s (end-to-end)

  Total: 4096 tokens in 55.22s
  Total throughput: 74.1 tok/s (4 requests completed)

Qwen 3.6

Here is a same-ish recipe for Qwen 3.6 35B A3B FP8. While speculative decoding is mathematically lossless, I don’t recommend this model for large tasks with lots of instructions as it is prone to token and instruction entanglement.

# Recipe: Qwen3.6-35B-A3B-FP8
# Qwen3.6-122B model in FP8 quantization + DFlash

recipe_version: "1"
name: Qwen3.6-35B-A3B-FP8
description: vLLM serving Qwen3.6-35B-A3B-FP8

# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8

solo_only: true

# Container image to use
container: vllm-node-tf5

build_args:
  - --tf5

mods:
  - mods/fix-qwen3.5-enhanced-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 196608
  gpu_memory_utilization: 0.78
  max_num_batched_tokens: 32768
  max-num-seqs: 8
  served_model_name: qwen/qwen3.6
  speculative_config: '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --dtype bfloat16 \
  --port {port} \
  --host {host} \
  --load-format instanttensor \
  --attention-backend flash_attn \
  --speculative-config '{speculative_config}' \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --chat-template qwen3.5-enhanced.jinja \
  --reasoning-parser qwen3
  
#  --language-model-only
╔══════════════════════════════════════════════════════╗
β•‘  Benchmark: qwen3.6  35B /w bfloat16 β€”  2026-04-17
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   3.91s = 65.3 tok/s
  [Code      ]   512 tokens in   8.03s = 63.6 tok/s
  [JSON      ]  1024 tokens in  14.44s = 70.9 tok/s
  [Math      ]    32 tokens in    .53s = 60.2 tok/s
  [LongCode  ]  2048 tokens in  29.37s = 69.7 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   3.91s = 65.4 tok/s
  [Code      ]   512 tokens in   8.06s = 63.5 tok/s
  [JSON      ]  1024 tokens in  14.50s = 70.6 tok/s
  [Math      ]    32 tokens in    .53s = 59.9 tok/s
  [LongCode  ]  2048 tokens in  29.33s = 69.8 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 36.5 tok/s (end-to-end)
  [req2 ]  1024 tokens = 36.5 tok/s (end-to-end)
  [req3 ]  1024 tokens = 36.5 tok/s (end-to-end)
  [req4 ]  1024 tokens = 36.5 tok/s (end-to-end)

  Total: 4096 tokens in 28.06s
  Total throughput: 145.9 tok/s (4 requests completed)

Agreed on fp8_e4m3. I also noticed the problem at large contexts.
For OpenCode, my suggestion is just to make this specific harness more flexible, because the Build agent should not behave like the Plan one and model tuning can help there.

Same 122b bfloat16 Quality w/ MTP – 38% faster

Here is an updated recipe for those interested.

Step 1: Update spark-vllm-docker

#!/bin/bash

cd ~/spark-vllm-docker
git pull
./build-and-copy.sh --tf5

Step 2: Download Extended Calibration model

~ 4GiB smaller

cd ~/spark-vllm-docker
./hf-download.sh "shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC"

Step 3: Add New Recipe

~/spark-vllm-docker/recipes/qwen3.5-122b.yaml

# Recipe: shieldstar Qwen3.5-122B-A10B-iNT4-Autoround-EC
# Extended Calibration (EC) INT4 AutoRound quantization of Qwen/Qwen3.5-122B-A10B,
# a 122B MoE (10B active) multimodal model. Drop-in replacement for 
# Intel/Qwen3.5-122B-A10B-int4-AutoRound with wider calibration settings for improved
# quality on long-context and reasoning-heavy workloads.

recipe_version: "1"
name: Qwen3.5-122B-A10B-int4-AutoRound
description: vLLM serving Qwen3.5-122B-A10B-int4-AutoRound

# HuggingFace model to download (optional, for --download-model)
model: shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC

solo_only: true

# Container image to use
container: vllm-node-tf5

build_args:
  - --tf5

mods:
  - mods/fix-qwen3.5-enhanced-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 196608
  gpu_memory_utilization: 0.80
  max_num_batched_tokens: 32768
  max-num-seqs: 8
  served_model_name: qwen/qwen3.5-122b
  speculative_config: '{"method": "mtp", "num_speculative_tokens": 2}'
  coding_config: '{"temperature": 0.7,  "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'
  writing_config: '{"temperature": 0.9,  "top_p": 0.8, "top_k": 20, "presence_penalty": 1.5, "repetition_penalty": 1.1}'

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --dtype bfloat16 \
  --port {port} \
  --host {host} \
  --load-format instanttensor \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --speculative-config '{speculative_config}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --chat-template qwen3.5-enhanced.jinja \
  --reasoning-parser qwen3 \
  --generation-config auto \
  --override-generation-config '{coding_config}'

#  --language-model-only

Benchmark

Average acceptance rates: 80% - 95%

╔══════════════════════════════════════════════════════╗
β•‘  Benchmark: qwen3.5 122b /w bfloat16 & MTP  β€”  2026-04-20
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   6.62s = 38.6 tok/s
  [Code      ]   512 tokens in  12.43s = 41.1 tok/s
  [JSON      ]  1024 tokens in  25.10s = 40.7 tok/s
  [Math      ]    32 tokens in    .90s = 35.3 tok/s
  [LongCode  ]  2048 tokens in  48.62s = 42.1 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   6.60s = 38.7 tok/s
  [Code      ]   512 tokens in  12.64s = 40.4 tok/s
  [JSON      ]  1024 tokens in  25.35s = 40.3 tok/s
  [Math      ]    32 tokens in    .90s = 35.3 tok/s
  [LongCode  ]  2048 tokens in  48.91s = 41.8 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 25.7 tok/s (end-to-end)
  [req2 ]  1024 tokens = 25.6 tok/s (end-to-end)
  [req3 ]  1024 tokens = 25.4 tok/s (end-to-end)
  [req4 ]  1024 tokens = 25.4 tok/s (end-to-end)

Thanks! I’m downloading this and I’ll be testing tomorrow. I was doing A-B testing with the new Qwen3.6-35B and the 122b with MTP from Albond’s. I’ll throw this in the mix to see which one works best for me :)

@whpthomas in your Experience would you say the Qwen 3.6 122B is worse than Qwen 3.6 35B, Artificial Analysis has the 3.6 beating the 3.5 122B in almost every single benchmark. I plan to use it in opencode. What was your experience so far?

My tests (European languages) show: Qwen3.6-35B-A3B-FP8 is much much better than Qwen3.5-122B-A10B-int4-AutoRound. Don’t know of quantization (int4-autoround) makes the effect. But it’s very evident for my use cases.

So in my project I am working with client data that needs to be air-gapped. So I code exclusively with my DGX Spark providing inference. I am writing workflow orchestration tools for document analysis were we are trying to achieve high reliability scores using generator adversarial networks (GAN) (think assessment-criteria β†’ implement ↔ validate loops) for automation. We have workflows with 40+ steps where each step runs in a fresh context. For this work the Qwen3.6-35B-A3B-FP8 with --dtype bfloat16 is a strong candidate.

For my actual programming work I am generating code containing prompts, researching a codebase that also contains prompts. I went into it in more detail here:

Qwen3.6-35B-A3B-FP8 is not able to perform this task. I think the A3B is not sufficient for compartmentalising the difference between the system/user prompt, and the prompts within the codebase itself. With tis task model quality is really obvious and very noticeable. This is not picked up by benchmarks. Many frontier models are bad at this too. I think a lot of inference providers are serving FP8 KV quants and in my particular use case it shows. Running bfloat16 locally in this respect is an advantage. In the past I used to run the same task brief on multiple models on OpenRouter, just to find the ones that would stay on task and finish the job. Its really frustrating to think you can trust a model, pay by the token, only to be 40 minute into a session and have the model start looping (making a change, but wait! undoing the change, but wait! making the same change), or choosing it own adventure – your absolutely right.

When I talk about my experiences with quality this is what I am talking about. Working on multiple sessions in parallel, working closely on research and design – expecting the model to complete a multi-phase plan with minimal supervision. Being able to explain and justify the changes at the end. Yes with orchestration, file based state management and fresh context you can do that with Qwen3.5 122B at the moment.

Your requirements might be completely different.

for wat is worth, This is the best recipe I tried for Qwen3.6-35B. It’s performing well for me (on-par or better than 3.5-122b Hybrid, need to keep testing to have a final word on it).

Also, this qwen3.6 MTP parallels REALLY well. I can do two tasks at the same time and I don’t even realize I’m going twice. I think that’s the biggest win. I was playing around with the DFLASH speculative options for Qwen3.6 and did NOT like the results at all. Yeah Tok/s looked high, but the Acceptance rate was on the floor. 15 tokens for a 10% acceptance rate is a lot of overhead to gain 15% of speed.

Under instruction pressure (200+ clauses) I found 3.5-122b Hybrid got opinionated (couldn’t course correct mid-stream) and panicky (guesses a lot, makes random tactical decisions, rather than thinking systematically). I suspect its loosing detail in the KV. For me its really noticeable. Sessions start well, everything looks good, but the performance is really jagged. Sometimes fast and competent other times a huge time waste.

Have you tried the dense Qwen3.5-27B or Gemma4-31B-it in similar scenarios?

Some of this may be inherent to MoE architecture. I do a lot with MoEs but then go back to a ~30B dense model and it feels like the output is just so much clearer on complex topics.

I tend to load a model up and program all day, and only change if there is a problem. Most days I default to Qwen 3.5 122b – call me a creature of habit.

I have said it elsewhere – my observations are probably loaded with confirmation bias. I am constantly improving my prompting / harness engineering techniques at the same time. Learning to work within the limitations of self hosted models rather than fighting with them. I am using an 8 step orchestration chain: questions β†’ research β†’ design β†’ structure β†’ multi-plan β†’ acceptance-criteria β†’ implement ↔ validate GAN loop leading to early alignment, small prompts, improved reliability and task completion – so significant parallel improvements complementing the model and recipe gains.

So I guess the answer is – I have scripts for all of them but seldom switch because its not the only knife in the drawer. Once something works I tend to stick with it as a daily driver to get work done.

I try everything else that comes along that looks promising because I suspect there is still room for optimisation and improvement – and to support the important contributions others are making, even if in my case they don’t work out.

@whpthomas Do you use Opencode with a harness like β€œGet β– β– β– β–  Done”? or β€œBMAD”? or something in that direction that you created yourself?

My main 6 month old project is an harness built on a MVCC object relational db with a typescript like expression syntax embedded in markdown resulting in an object orientated first class construct encapsulating the prompt and its code in an ai-native business oriented workspace. But its open source release is still some time away.

In the mean time, I took some of the core ideas and wrote a simple orchestration layer as an OpenCode plugin named Orca2 which β€œorchestrates agentic workflows via YAML definitions. It uses file-based state management to track progress, making workflows transparent, auditable, and resilient to process restarts. It also manages task list completion on a per-task basis using generated artefacts to manage state across multi-step loops.”

Just need to do a bit more testing to flesh out the examples – basically generalising my personal development toolset. I have meetings tomorrow, hope to release it by this coming Friday. I will post if you are interested?

I have started releasing business process OpenCode tools on npm and have a bunch more on the way.