Bfloat16 Quality = Speed?

So when I first set up my machine, --kv-cache-dtype fp8 was the default in all my recipes. I had no understanding that bf16 was an option worth exploration. The discussions here, no doubt influenced by the Nvidia DGX Spark marketing was that we should be aiming for --kv-cache-dtype NVFP4. Then I had a discussion with @eugr and @flash3 about all the tool call failures I was experiencing and the eventual outcome was a suggestion to use bfloat16 with int4 AutoRound (which I had never heard of) with Qwen3 Coder Next. This was long before AutoRound had gained popularity. At the time vllm support for this configuration seemed nascent.

Back in Feb the dominant conversations on this forum was almost exclusively focused on t/s. I found this a real time sink because almost everything I tried couldn’t sustain long contexts, instruction following and tool calls. Then Qwen3.5 dropped and while I was attempting to quantise it to AutoRound myself, Intel released their versions in short succession.

It takes a lot of time to download models, wait 10 minutes for vllm to startup, run a 30 min process, tweak a setting, wait 10 minutes for vllm to startup again, re-run a 30 min process, compare results - rinse and repeat.

It may seem perfunctory now to be discussing these settings, but back in Feb this really wasn’t a conversation that was being had – at least not one visible to me. I had clients who needed deliverables and I was lost down endless rabbit holes getting nowhere – mad how fast moving this has been.

I started this thread because I figured other members might also be tired of wasting time and getting nowhere too. I have learned a lot from everyone in the meantime. I have a daily driver setup now that I am satisfied with, and still keenly experimenting with PrismaQuant v2 – so I guess a foot in each camp ;)

Here is my current recipe, which gave me stable work with pretty good t/s:

name: Qwen3.5-122B-A10B-int4-AutoRound

container: vllm-node-tf5

mods:

  • mods/fix-qwen3.5-enhanced-chat-template

model: Intel/Qwen3.5-122B-A10B-int4-AutoRound

defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.82
max_model_len: 262144
max_num_batched_tokens: 16384
max-num-seqs: 1
served_model_name: qwen35

env:
HF_HUB_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
VLLM_MARLIN_USE_ATOMIC_ADD: 1

command: |
vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound
–served-model-name {served_model_name}
–host {host}
–port {port}
–max-model-len {max_model_len}
–max-num-batched-tokens {max_num_batched_tokens}
–gpu-memory-utilization {gpu_memory_utilization}
–load-format instanttensor
–enable-prefix-caching
–enable-chunked-prefill
–attention-backend FLASHINFER
–dtype bfloat16
–reasoning-parser qwen3
–enable-auto-tool-choice
–tool-call-parser qwen3_xml
–generation-config auto
–speculative-config ‘{{“method”: “mtp”, “num_speculative_tokens”: 1}}’
–override-generation-config ‘{{“temperature”: 0.7, “top_p”: 0.8, “top_k”: 20, “presence_penalty”: 0.0, “repetition_penalty”: 1.0}}’
–chat-template qwen3.5-enhanced.jinja
–max-num-seqs {max-num-seqs}
recipe_version: ‘1’
cluster_only: false
solo_only: true

@whpthomas for tool calling have you ever switched between chat-template-content-format string and openai and tried around with that? Just noticed that with the latest qwen-3.6-enhanced vLLM auto detects string mode, whereas before it was using openai mode which hade more problems during tool calling for me.

This type of posts is why in the IT industry I don’t like when people say “Best practice”. I encourage everyone to say “Lead Practice” instead.

Your example is the best one “At the time, the lead practice was to run Bfloat16”, everything evolves quickly! Now the lead practice might be different :)

Also, I’m learning that everyone’s workflow is different around here, the recipes that worked the best for me were yours, but in my case (single user) tok/s are more important than in your case that I saw processing 15-20 parallel PDFs where you can benefit of parallelism.

This has been a fun learning experience ;-)

No? please explain

the foundation is what’s wobbly.

LLMs are non-deterministic in content, response time, in basically everything. the structure of this jelly is being tinkered with on a weekly basis — there’s new DeltaNets and whatever else, which everyone then has to master overnight because it’s the hip thing right now. space problems at every corner, quantization breaks even the tooling, it’s like sticking a structurally stable cookie into the jelly and expecting it to wobble less afterwards.

DeepJelly recently dropped 1.6T of jelly, nobody knows where to put it. eating it normally isn’t an option. and over in the diet section, which is exactly where we are, people are thrilled they can even manage to nibble a bit of jelly off the tiniest fork. of course recipes get swapped. survival tips. nutshell at Cape Horn. might work out. but the attrition is high, since plenty of assumptions don’t pan out.

the jelly doesn’t make you smart. it only works when it’s corseted by tooling and system prompts. the best jelly is the one you don’t have to talk to, because otherwise you might get annoyed all over again. the life-time budget being burned through here is approaching the Guinness record for best phantom productivity ever. and that’s before we even get to the perception distortion, because some folks already use the jelly to button up their pants.

the question is: do I think it through myself for a moment, or do I ask the jelly? and how do I know the jelly is right if I haven’t thought it through myself beforehand. better to ask first — then new ideas come up.


 and maybe I’ll tune it a bit more so it answers faster. but is it answering correctly then? for that I’d really need to be sure and ask the untuned version again beforehand — but was that one even right to begin with?

Personally I mostly use SDX Spark for coding, what I am coding also uses DGX Spark for inference. Ultimately thats applied research into agentic patterns that work reliably for air-gapped business automations. So long context, long running, instruction following, tool calling and multi-modal.

Just want to say that your example and recipes helped a lot.
Thank you for that.

With RLM and additional MCP tools we could get this model doing complex code tasks - 99.9% sure

The scope is wider than just serve it, let’s test also with what instruments and how this model can bring value

Henry, I am trying to run your latest recipe with latest Eugr setup, and it crashes something hard. Any chance you have been doing some updates/changes not captured in this thread ? I am on the latest OS release. Thanks

This is my daily driver at the moment.

# Recipe: Intel Qwen3.5-122B-A10B-int4-Autoround

recipe_version: "1"
name: Qwen3.5-122B-A10B-int4-AutoRound
description: vLLM serving Qwen3.5-122B-A10B-int4-AutoRound

# HuggingFace model to download (optional, for --download-model)
model: Intel/Qwen3.5-122B-A10B-int4-AutoRound

solo_only: true

# Container image to use
container: vllm-node-tf5

build_args:
  - --tf5

mods:
  - mods/fix-qwen3.5-enhanced-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 196608
  gpu_memory_utilization: 0.76
  max_num_batched_tokens: 16384
  max-num-seqs: 16
  served_model_name: qwen/qwen3.5-122b
  speculative_config: '{"method": "mtp", "num_speculative_tokens": 3}'
  coding_config: '{"temperature": 0.7,  "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --dtype bfloat16 \
  --kv-cache-dtype fp8_e4m3 \
  --port {port} \
  --host {host} \
  --load-format instanttensor \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --speculative-config '{speculative_config}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --chat-template qwen3.5-enhanced.jinja \
  --reasoning-parser qwen3 \
  --generation-config auto \
  --override-generation-config '{coding_config}'

#  --language-model-only

I ran this script today so I have the latest image.

~/update.sh

#!/bin/bash

cd spark-vllm-docker
git pull
./build-and-copy.sh --tf5

I use this script to restart the vllm server if it crashes, which it sometimes does if I cancel concurrent multi-modal requests mid-stream.

~/vllm.sh

#!/bin/bash
#
# vllm.sh - Persistent vllm runner with auto-restart and graceful shutdown
#
# Usage:
#   ./vllm.sh start    - Start vllm in a loop
#   ./vllm.sh stop     - Gracefully stop vllm
#   ./vllm.sh status   - Check if vllm is running
#

RECIPE="qwen3.5-122b"
#RECIPE="qwen3.6-35b"
#RECIPE="qwen3.6-27b"
#ECIPE="qwen3.6-prisma"

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
STOP_FILE="$SCRIPT_DIR/.vllm-stop-request"
RESTART_DELAY=5
RESTART_COUNT=0

# Echo function for console output only
echo_msg() {
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    echo "[$timestamp] $1"
}

# Start vllm process
start_vllm() {
    echo_msg "Starting vllm with $RECIPE"
    
    # Clear any previous stop request
    rm -f "$STOP_FILE"
    
    # Run the vllm script
    cd ~/spark-vllm-docker
    ./run-recipe.sh "$RECIPE"
    EXIT_CODE=$?
    
    # Check if we should stop
    if [[ -f "$STOP_FILE" ]]; then
        echo_msg "Shutdown requested. Exiting loop."
        rm -f "$STOP_FILE"
        exit 0
    fi
    
    # Process crashed or exited unexpectedly
    echo_msg "vllm exited with code: $EXIT_CODE"
    
    RESTART_COUNT=$((RESTART_COUNT + 1))
    echo_msg "Restarting in $RESTART_DELAY seconds... (restart #$RESTART_COUNT)"
    sleep $RESTART_DELAY
    
    start_vllm
}

# Stop vllm gracefully using the correct Docker procedure
stop_vllm() {
    echo_msg "Stopping vllm..."
    
    # Create stop signal file to notify running loop
    touch "$STOP_FILE"
    
    # Use the correct Docker stop procedure
    docker stop $(docker ps -q --filter "name=vllm")

    echo_msg "vllm stop signal sent"
}

# Check status
status_vllm() {
    if docker ps --format '{{.Names}}' | grep -q '^vllm'; then
        echo "vllm is running"
        echo "Restart count: $RESTART_COUNT"
        return 0
    else
        echo "vllm is not running"
        return 1
    fi
}

# Main command handler
case "${1:-start}" in
    start)
        echo_msg "=== vllm loop runner starting ==="
        start_vllm
        ;;
    stop)
        stop_vllm
        ;;
    status)
        status_vllm
        ;;
    *)
        echo "Usage: $0 {start|stop|status}"
        exit 1
        ;;
esac

Thank you, I think the sanity bit was the rebuild of TF5, even though I did that 1-2 days ago.

Yeah something has been off this week. Things that worked before broke. Needed small changes.

BTW these settings:

  max_model_len: 196608
  gpu_memory_utilization: 0.76
  max_num_batched_tokens: 16384
  max-num-seqs: 16


 have been rigorously tested on concurrent workloads. Context rot sets in at about 130k so you don’t need more then 196K. 32768 batched token is marginally faster but prone to OOM errors when saturated. 16 seqs in combination with 16384 runs the cache at about 86% leaving headroom for surges.

You can push hard all day with this setup reliably on a single DGX spark.

From the documentation, vLLM basically converts the input to the openai format, if that format is used, otherwise it just passes on the string from the LLM

The format to render message content within a chat template. * “string” will render the content as a string. Example: “Hello World” * “openai” will render the content as a list of dictionaries, similar to OpenAI schema. Example: [{“type”: “text”, “text”: “Hello world!”}]

Ok I understand now, the difference between JSON format and markdown, I didn’t realised the term for this was chat-template-content-format but that makes perfect sense now. So do coding harnesses like OpenCode have to be configured for this or do they detect it? I know there is also --tool-call-parser qwen3_xml but I could never get that to work, so I just assumed it was a protocol, the harness either expects one or the other, but I never looked any deeper into it.

I havent done much testing with this difference yet, I just noticed it in the startup logs when I switched to the qwen-3.6 enhanced template, for the other templates it usually selected the openai mode automatically. In any case this template did fix nearly all tool calling issues for me. Not sure if the template option has anything to do with it as well yet :)

Weirdly I was using the 3.6 enhanced template up until last week, and after updating spark-vllm-docker everything broke. I went through disabling settings one by one and removing the Qwen-3.6 enhanced template fixed my tool calls. So I just ran with that for the time being. A lot of this for me is trial and error. If it works, I try to leave it alone for as long as I can. Then Rob released PrimaQuant v2 and I updated and everything works slightly differently. I know I should probably save different docker images and pair them with specific models, but there have been so many great improvements lately I keep rolling the dice.

I am also usually going this route, though I just number the new templates and cycle over the old images once I am no longer using them, its a habbit that formed in the early days, when you had to compile everything which took forever and stuff broke a lot more often. So I just have a couple ie vllm-node-tf5-X images non tf5 and some special custom PR templates to run stuff like google MTP and so on :)

Hey, folks! After extensive testing of MTP, I saw a drawback in setting num_speculative_tokens more than 1 - beacuse it breaks tool usage. Got a lot of opencode errors like “Expected ‘function.name’ to be a string.” which break the loop and model stops. May be you know a workaround to such behavior?