I have been playing around with lots of interesting new models and techniques these past few weeks, and somehow lost sight of an important lesson I learned last time I saw a bright shiny new capability and started venturing down rabbit holes after it. Token per second benchmarks are meaningless. Fast tokens are useless for coding when they are low quality.
Quality = Speed
Spending hours fighting with a fast model setup isnβt coding, its just busy work IMO.
Basically: even thought bfloat 16 is expensive, we end up doing so little of it with each interaction that it ends up being relatively cheap. Where the losses occur also matters. They happen in what the model knows, and its understanding of how to do things. Not in the instructions you give it or the code it loads into the context. The benefits of higher quality compound. Better reasoning, less wastes effort, less oversight, leading to a quicker overall result and a more pleasant day.
This is the recipe I was using as a daily driver since Intel/Qwen3.5-122B-A10B-int4-AutoRound was release. This is my preferred setup. It is intended for long running coding tasks. It runs at about 24 t/s all day.
spark-vllm-docker/recipes/qwen3.5-122b.yaml
# Recipe: Qwen3.5-122B-A10B-iNT4-Autoround
# Qwen3.5-122B model in Intel INT4-Autoround quantization
recipe_version: "1"
name: Qwen3.5-122B-A10B-int4-AutoRound
description: vLLM serving Qwen3.5-122B-A10B-int4-AutoRound
# HuggingFace model to download (optional, for --download-model)
model: Intel/Qwen3.5-122B-A10B-int4-AutoRound
solo_only: true
# Container image to use
container: vllm-node-tf5
build_args:
- --tf5
mods:
- mods/fix-qwen3.5-enhanced-chat-template
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
max_model_len: 196608
gpu_memory_utilization: 0.82
max_num_batched_tokens: 32768
max-num-seqs: 8
served_model_name: qwen/qwen3.5
coding_coding: '{"temperature": 0.7, "min_p": 0.05, "top_p": 1.0, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'
writing_config: '{"temperature": 0.9, "min_p": 0.05, "top_p": 1.0, "top_k": 20, "presence_penalty": 1.5, "repetition_penalty": 1.1}'
# Environment variables
env:
VLLM_MARLIN_USE_ATOMIC_ADD: 1
# The vLLM serve command template
command: |
vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound \
--served-model-name {served_model_name} \
--max-model-len {max_model_len} \
--gpu-memory-utilization {gpu_memory_utilization} \
--max-num-batched-tokens {max_num_batched_tokens} \
--max-num-seqs {max-num-seqs} \
--dtype bfloat16 \
--port {port} \
--host {host} \
--load-format instanttensor \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--chat-template qwen3.5-enhanced.jinja \
--reasoning-parser qwen3 \
--generation-config auto \
--override-generation-config '{coding_coding}'
Keeping quality loss at a minimum is why I think it works so well. Let me explainβ¦
Community Supported
This recipe runs on @eugrβs spark-vllm-docker a vLLM docker optimized for the DGX spark
AutoRound
This is the only source of quality loss in this recipe. This loss is in what the model knows, and its understanding of how to do things β not the context.
Intelβs AutoRound works exceptionally well on the DGX Spark. Mixture-of-Experts models are notoriously sensitive to quantisation. AutoRound uses signed gradient descent to jointly optimise weight rounding and clipping ranges. This preserves the βdistributionβ of the weights rather than just the values, keeping the MoE logic intact even at 4-bit. The weights effectively halve the model size. The Blackwell GPU then needs less bandwidth to pull these weights from the unified pool. Once they reach the GPU, the Tensor Cores dequantises INT4 weights into bfloat16 on-the-fly for the actual math, giving the speed of 4-bit with the precision of 16-bit.
Loading the model requires 63GiB of memory, leaving ample room for the KV cache.
cd spark-vllm-docker
./hf-download.sh "Intel/Qwen3.5-122B-A10B-int4-AutoRound"
dtype bfloat16
We make room with the quantisation so we can maintain quality in the KV cache.
The GB10 chip inside the DGX Spark sm_121 architecture has dedicated 5th Gen Tensor Cores designed for bfloat16 and FP4. While the weights are stored as INT4, the Tensor Cores perform the actual matrix multiplications in bfloat16 which maintains the high numerical stability required by the Mixture of Experts (MoE) architecture. Even though the model has 122B parameters, it only activates a 10B of them for any single token, significantly reducing the size of the activations (the temporary math) moving through the chips during the forward pass, leaving βbreathing roomβ for everything else.
Prefix caching
OpenCode sends mostly the same prompt repeatedly with new context appended to it, vLLMβs Prefix Caching keeps those tokens in memory and reuses them. It doesnβt create a new KV cache for the context window every time, which saves a massive amount of space. On normal coding tasks the prefix cache hit rate raises to over 90% very quickly. With 32K token batches, you get maximum available throughput. Because vLLM is caching nearly the entire context, it isnβt recalculating or duplicating the KV cache for the bulk of your data (your code) every time you interact with it. Reducing memory overhead and time to first token significantly. As the context grows, vLLM dynamically leverages the remaining memory pool for the KV Cache.
What Quality Looks Like
Where I see the difference is at 130k I can still give the model instructions and it follows them. It can still provide insights about the problem its solving. It can change direction. It remains on task. I can leave it to finish up on its own.
Qwen3.5 Tool Call Fix
This makes tool calling relatively flawless.
Download qwen3.5-enhanced.jinja
Create a mod directory in spark-vllm-docker/mods/fix-qwen3.5-enhanced-chat-template with the following files
qwen3.5-enhanced.jinjarun.sh
#!/bin/bash
set -e
cp qwen3.5-enhanced.jinja $WORKSPACE_DIR/qwen3.5-enhanced.jinja
echo "=======> to apply chat template, use --chat-template qwen3.5-enhanced.jinja"
Use either:
--tool-call-parser qwen3_coderfor OpenCode--tool-call-parser qwen3_xmlfor other coding harnesses
Launch it
run-qwen.sh
#!/bin/bash
cd ~/spark-vllm-docker
./run-recipe.sh qwen3.5-122b
stop-qwen.sh
#!/bin/bash
docker stop $(docker ps -q --filter "name=vllm")
OpenCode setup
.config/opencode/opencode.json
{
"$schema": "https://opencode.ai/config.json",
"model": "qwen/qwen3.5",
"provider": {
"local-vllm": {
"npm": "@ai-sdk/openai-compatible",
"name": "vLLM (local)",
"options": {
"baseURL": "http://dgx-spark.local:8000/v1",
"apiKey": "dummy-key"
},
"models": {
"qwen/qwen3.5": {
"name": "Qwen 3.5 (local)",
"tool_call": true,
"reasoning": true,
"limit": {
"context": 196608,
"output": 16384
},
"modalities": {
"input": [
"text",
"image"
],
"output": [
"text"
]
},
}
}
}
}
}