Qwen/Qwen3.6-35B-A3B (and FP8) has landed

I use OpenCode. Thats what I use this patch with. Virtually no errors after making this change. Maybe its got to do with what the harness is expect. I don’t know why. Use your judgment with your own setup. If tool calls start failing you will know pretty quickly.

2 Likes

Start here: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks Β· GitHub

There are plenty of recipes already in the install, but you can add my Qwen 3.6 35B recipe+mod if you like. (I bet eugr will have an official recipe soon.)

1 Like

I have tested Qwen3.6-35B with llama.cpp and vLLM, and the best result I got was with llama.cpp using the Vulkan backend (96).

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Run ID ┃ Model ┃ Score ┃ Rating ┃ Date ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
β”‚ 2026-04-20T15-08-01Z_6a3f22 β”‚ Qwen/Qwen3.6-35B-A3B-FP8 β”‚ 92 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T15:22:58 β”‚
β”‚ 2026-04-20T14-25-50Z_6a3f22 β”‚ Qwen/Qwen3.6-35B-A3B-FP8 β”‚ 91 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T14:40:55 β”‚
β”‚ 2026-04-20T13-52-48Z_6a3f22 β”‚ Qwen/Qwen3.6-35B-A3B-FP8 β”‚ 91 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T14:06:20 β”‚
β”‚ 2026-04-20T12-53-18Z_e8d504 β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL β”‚ 93 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T13:08:28 β”‚
β”‚ 2026-04-20T12-30-06Z_fd8ebc β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M β”‚ 90 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T12:43:28 β”‚
β”‚ 2026-04-20T12-15-29Z_179efc β”‚ mudler/Qwen3.6-35B-A3B-APEX-GGUF β”‚ 90 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T12:29:11 β”‚
β”‚ 2026-04-20T11-57-49Z_179efc β”‚ mudler/Qwen3.6-35B-A3B-APEX-GGUF β”‚ 90 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T12:13:21 β”‚
β”‚ 2026-04-20T11-29-55Z_54ddbe β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL β”‚ 92 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T11:47:22 β”‚
β”‚ 2026-04-20T09-29-07Z_e8d504 β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL β”‚ 93 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T09:44:14 β”‚
β”‚ 2026-04-20T08-12-45Z_e8d504 β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL β”‚ 93 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T08:27:58 β”‚
β”‚ 2026-04-20T07-56-44Z_e8d504 β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL β”‚ 95 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T08:12:08 β”‚
β”‚ 2026-04-20T07-41-48Z_849f0b β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:MXFP4_MOE β”‚ 92 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T07:55:12 β”‚
β”‚ 2026-04-20T07-26-54Z_849f0b β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:MXFP4_MOE β”‚ 93 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T07:40:33 β”‚
β”‚ 2026-04-20T07-00-56Z_fd8ebc β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M β”‚ 96 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T07:14:04 β”‚
β”‚ 2026-04-20T06-47-05Z_fd8ebc β”‚ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M β”‚ 93 β”‚ β˜…β˜…β˜…β˜…β˜… Excellent β”‚ 2026-04-20T07:00:04 β”‚

2 Likes

On a DGX Spark ?

1 Like

Yeah, same here . Reason is quite simple, it affects the context. Essentially you are giving instructions to the model to fix it’s output. Think of it like micromanagement. In any case, if it works and that’s what is necessary. For example, I had this in my system prompt β€œDo not make things up if you do not know the answer”, and that does wonders for hallucination. :-D

Yes all this test done on ASUS GX10. Vulkan can be faster than CUDA for some models(on llama.cpp).

Sorry, not sure what results you are referring to ? TPS or ? What is β€œSCORE” ?

tool-eval-bench result (Tool-Call Benchmark), half of the posts in this topic has this bench results.

1 Like

Just trying to figure where the β€œfaster” came from.

faster in PP and TG if we compare cuda and vulkan llama.cpp backends

Same, I’m testing a few other combos today, but this ones does it for me. FP8, fast and MTP enabled.

Briefly tried Qwen3.6-27B this morning and it’s painfully slow. Not sure if there’s an FP8 available, if there is, I’ll try it tonight. But I think I will stick with 3.6-35B-A3B-FP8 until a Quantized 122B is available. :)

FP8 has been available since 27B has launched. Qwen/Qwen3.6-27B-FP8 Β· Hugging Face

2 Likes

For the ones that like to go down the rabbit hole

Extended Calibration (EC) INT4 AutoRound quantization of Qwen/Qwen3.6-35B-A3B, a 35B MoE (3B active, 128 experts) multimodal model. Drop-in replacement for Intel/Qwen3.6-35B-A3B-int4-AutoRound with wider calibration settings for improved quality on long-context and reasoning-heavy workloads.

Passing the smoke test

● CLEAN: 87/100 β˜…β˜…β˜…β˜… Good. 0 connection errors.
  EC vs FP8 head-to-head (both on spark1)

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚      metric       β”‚       EC INT4       β”‚     FP8      β”‚          Ξ”           β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Score             β”‚ 87                  β”‚ 90           β”‚ -3 (within 2Οƒ noise) β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Pass/partial/fail β”‚ 54/12/3             β”‚ 57/10/2      β”‚ ~same                β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Quality           β”‚ 87                  β”‚ 90           β”‚ -3                   β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Responsiveness    β”‚ 65                  β”‚ 45           β”‚ +20 (EC faster)      β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Deployability     β”‚ 80                  β”‚ 76           β”‚ +4 (EC better)       β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ TTFT single       β”‚ 314 ms              β”‚ 1344 ms      β”‚ -76% (4.3Γ— faster)   β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Single tg t/s     β”‚ 71.3                β”‚ 52.3         β”‚ +36%                 β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ c2 tg t/s         β”‚ 123.3               β”‚ 82.3         β”‚ +50%                 β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ c4 tg t/s         β”‚ 173.3               β”‚ 104.4        β”‚ +66%                 β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Median turn       β”‚ 2.0 s               β”‚ 3.8 s        β”‚ -47%                 β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Total eval        β”‚ 608 s               β”‚ 987 s        β”‚ -38%                 β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Weakest cat       β”‚ L Toolset Scale 62% β”‚ K Safety 77% β”‚ β€”                    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
3 Likes

GPT got the Intel quant working with TP=2. A few notes:

  1. Have to patch in conch-triton-kernels. Did this with a separate dockerfile using a current TF5 image
FROM vllm-tf5:20260423
RUN uv pip install -U conch-triton-kernels
  1. Obviously rename it
docker build -t vllm-tf5:20260423-conch .
  1. Must set --max-num-batched-tokens 2048
  2. Cannot use --enable-prefix-caching
1 Like

Hi, was having reboots and error during heavy loads with big context and token loads. came to below recipe, seems stable now build for latest firmware and patch level gx10

# Qwen/Qwen3.6-35B-A3B model in native FP8 format



recipe_version: "1"

name: Qwen35-35B-A3B




# HuggingFace model to download (optional, for --download-model)

model: Qwen/Qwen3.6-35B-A3B-FP8


solo_only: true


# Container image to use

container: vllm-node-tf5


# Mod

mods:

  - mods/fix-qwen3.5-chat-template


defaults:

  port: 8000

  host: 0.0.0.0

  gpu_memory_utilization: 0.75   # Safe buffer for R590 driver overhead, also running other stuff 

  max_model_len: 131072          # 128k context is the stability sweet spot was having problems with 265k

  max_num_batched_tokens: 32768



env:

  # PyTorch/Triton Stability (Unquoted for cleanliness)

  TORCHINDUCTOR_MAX_AUTOTUNE: 0

  TRITON_MAX_AUTOTUNE: 0

  # Grace CPU pinning (Essential for the '94C CPU' reboot bug)

  OMP_NUM_THREADS: 16

  VLLM_CPU_OMP_THREADS_BIND: 1



command: |

  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \

    --served-model-name qwen35b \

    --host {host} \

    --port {port} \

    --max-model-len {max_model_len} \

    --max-num-batched-tokens {max_num_batched_tokens} \

    --gpu-memory-utilization {gpu_memory_utilization} \

    --enable-auto-tool-choice \

    --attention-backend flashinfer \

    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \

    --kv-cache-dtype fp8 \

    --enable-chunked-prefill \

    --disable-custom-all-reduce \

    --trust-remote-code \

    --load-format fastsafetensors \

    --tool-call-parser qwen3_xml \

    --reasoning-parser qwen3 \

    --enable-prefix-caching

Any tips ? Did I miss something, is there a better way. The INT-autoround gave to many errors for me. Maybe there is something I should have done better with that ? Any tips ? its quicker but feels very much less in quality. Might just be me. This is stable no reboots yet and max 71c temp. under heavy load and still 51ts

I was not able to run this successfully with TP=2 (dual node setup). Did you have a chance to try dual node by chance?

I am bit fried, lets see what comes out of it :)

● Root cause Marlin kernel needs output_size_per_partition % 64 == 0. Our A3B expert weights split across TP=2 give 32 β€” Marlin rejects. Suggested fixes:
–quantization gptq or reduce TP.

Two fallback paths worth trying:

Pipeline parallel (PP=2) β€” splits layers across nodes, doesn’t split per-layer weights β†’ no Marlin divisibility constraint

–quantization gptq on TP=2 β€” different kernel path

I tried this build and got some good results +56% throughput and +63% prefill on my best build.

2 Likes

Hey,

It really feels like things are moving quite fast these days, just when you begin to get stable results with one model, a newer version comes along.

From what you’re experiencing with Qwen3.5 and Gemma 4, it doesn’t seem like anything is going wrong on your side. These models can still be a little inconsistent when it comes to calling tools properly. At times they may choose the wrong tool, return an incorrect format, or not behave as expected.

The newer Qwen3.6-35B-A3B-FP8 does look quite promising, especially with improvements in handling coding tasks and maintaining context across steps. This could make things smoother, though it might be a good idea to test it gradually rather than switching everything at once. If you’d like a simple way to try and compare different tools in one place, you could also have a look at this link.

Since it follows a similar structure, it should work reasonably well with vLLM, although a few small adjustments might still be needed to get the best results.

1 Like

I’ll create the recipe, what is your GitHub username so I can credit you?

1 Like