Qwen/Qwen3.6-35B-A3B (and FP8) has landed

whpthomas · April 22, 2026, 2:09pm

I use OpenCode. Thats what I use this patch with. Virtually no errors after making this change. Maybe its got to do with what the harness is expect. I don’t know why. Use your judgment with your own setup. If tool calls start failing you will know pretty quickly.

chibri · April 22, 2026, 2:20pm

Start here: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub

There are plenty of recipes already in the install, but you can add my Qwen 3.6 35B recipe+mod if you like. (I bet eugr will have an official recipe soon.)

pontostroy · April 22, 2026, 3:11pm

I have tested Qwen3.6-35B with llama.cpp and vLLM, and the best result I got was with llama.cpp using the Vulkan backend (96).

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Run ID ┃ Model ┃ Score ┃ Rating ┃ Date ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ 2026-04-20T15-08-01Z_6a3f22 │ Qwen/Qwen3.6-35B-A3B-FP8 │ 92 │ ★★★★★ Excellent │ 2026-04-20T15:22:58 │
│ 2026-04-20T14-25-50Z_6a3f22 │ Qwen/Qwen3.6-35B-A3B-FP8 │ 91 │ ★★★★★ Excellent │ 2026-04-20T14:40:55 │
│ 2026-04-20T13-52-48Z_6a3f22 │ Qwen/Qwen3.6-35B-A3B-FP8 │ 91 │ ★★★★★ Excellent │ 2026-04-20T14:06:20 │
│ 2026-04-20T12-53-18Z_e8d504 │ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL │ 93 │ ★★★★★ Excellent │ 2026-04-20T13:08:28 │
│ 2026-04-20T12-30-06Z_fd8ebc │ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M │ 90 │ ★★★★★ Excellent │ 2026-04-20T12:43:28 │
│ 2026-04-20T12-15-29Z_179efc │ mudler/Qwen3.6-35B-A3B-APEX-GGUF │ 90 │ ★★★★★ Excellent │ 2026-04-20T12:29:11 │
│ 2026-04-20T11-57-49Z_179efc │ mudler/Qwen3.6-35B-A3B-APEX-GGUF │ 90 │ ★★★★★ Excellent │ 2026-04-20T12:13:21 │
│ 2026-04-20T11-29-55Z_54ddbe │ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL │ 92 │ ★★★★★ Excellent │ 2026-04-20T11:47:22 │
│ 2026-04-20T09-29-07Z_e8d504 │ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL │ 93 │ ★★★★★ Excellent │ 2026-04-20T09:44:14 │
│ 2026-04-20T08-12-45Z_e8d504 │ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL │ 93 │ ★★★★★ Excellent │ 2026-04-20T08:27:58 │
│ 2026-04-20T07-56-44Z_e8d504 │ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL │ 95 │ ★★★★★ Excellent │ 2026-04-20T08:12:08 │
│ 2026-04-20T07-41-48Z_849f0b │ unsloth/Qwen3.6-35B-A3B-GGUF:MXFP4_MOE │ 92 │ ★★★★★ Excellent │ 2026-04-20T07:55:12 │
│ 2026-04-20T07-26-54Z_849f0b │ unsloth/Qwen3.6-35B-A3B-GGUF:MXFP4_MOE │ 93 │ ★★★★★ Excellent │ 2026-04-20T07:40:33 │
│ 2026-04-20T07-00-56Z_fd8ebc │ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M │ 96 │ ★★★★★ Excellent │ 2026-04-20T07:14:04 │
│ 2026-04-20T06-47-05Z_fd8ebc │ unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M │ 93 │ ★★★★★ Excellent │ 2026-04-20T07:00:04 │

JW2026 · April 22, 2026, 3:18pm

On a DGX Spark ?

JW2026 · April 22, 2026, 3:23pm

Yeah, same here . Reason is quite simple, it affects the context. Essentially you are giving instructions to the model to fix it’s output. Think of it like micromanagement. In any case, if it works and that’s what is necessary. For example, I had this in my system prompt “Do not make things up if you do not know the answer”, and that does wonders for hallucination. :-D

pontostroy · April 22, 2026, 3:23pm

Yes all this test done on ASUS GX10. Vulkan can be faster than CUDA for some models(on llama.cpp).

JW2026 · April 22, 2026, 3:32pm

Sorry, not sure what results you are referring to ? TPS or ? What is “SCORE” ?

pontostroy · April 22, 2026, 3:42pm

tool-eval-bench result (Tool-Call Benchmark), half of the posts in this topic has this bench results.

JW2026 · April 22, 2026, 4:37pm

Just trying to figure where the “faster” came from.

pontostroy · April 22, 2026, 4:46pm

faster in PP and TG if we compare cuda and vulkan llama.cpp backends

azampatti · April 22, 2026, 5:57pm

Same, I’m testing a few other combos today, but this ones does it for me. FP8, fast and MTP enabled.

Briefly tried Qwen3.6-27B this morning and it’s painfully slow. Not sure if there’s an FP8 available, if there is, I’ll try it tonight. But I think I will stick with 3.6-35B-A3B-FP8 until a Quantized 122B is available. :)

VCR · April 22, 2026, 8:10pm

FP8 has been available since 27B has launched. Qwen/Qwen3.6-27B-FP8 · Hugging Face

grindstone · April 23, 2026, 2:22pm

For the ones that like to go down the rabbit hole

Extended Calibration (EC) INT4 AutoRound quantization of Qwen/Qwen3.6-35B-A3B, a 35B MoE (3B active, 128 experts) multimodal model. Drop-in replacement for Intel/Qwen3.6-35B-A3B-int4-AutoRound with wider calibration settings for improved quality on long-context and reasoning-heavy workloads.

Passing the smoke test

● CLEAN: 87/100 ★★★★ Good. 0 connection errors.
  EC vs FP8 head-to-head (both on spark1)

  ┌───────────────────┬─────────────────────┬──────────────┬──────────────────────┐
  │      metric       │       EC INT4       │     FP8      │          Δ           │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ Score             │ 87                  │ 90           │ -3 (within 2σ noise) │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ Pass/partial/fail │ 54/12/3             │ 57/10/2      │ ~same                │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ Quality           │ 87                  │ 90           │ -3                   │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ Responsiveness    │ 65                  │ 45           │ +20 (EC faster)      │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ Deployability     │ 80                  │ 76           │ +4 (EC better)       │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ TTFT single       │ 314 ms              │ 1344 ms      │ -76% (4.3× faster)   │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ Single tg t/s     │ 71.3                │ 52.3         │ +36%                 │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ c2 tg t/s         │ 123.3               │ 82.3         │ +50%                 │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ c4 tg t/s         │ 173.3               │ 104.4        │ +66%                 │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ Median turn       │ 2.0 s               │ 3.8 s        │ -47%                 │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ Total eval        │ 608 s               │ 987 s        │ -38%                 │
  ├───────────────────┼─────────────────────┼──────────────┼──────────────────────┤
  │ Weakest cat       │ L Toolset Scale 62% │ K Safety 77% │ —                    │
  └───────────────────┴─────────────────────┴──────────────┴──────────────────────┘

jrsphd · April 23, 2026, 7:48pm

GPT got the Intel quant working with TP=2. A few notes:

Have to patch in conch-triton-kernels. Did this with a separate dockerfile using a current TF5 image

FROM vllm-tf5:20260423
RUN uv pip install -U conch-triton-kernels

Obviously rename it

docker build -t vllm-tf5:20260423-conch .

Must set --max-num-batched-tokens 2048
Cannot use --enable-prefix-caching

vin5 · April 23, 2026, 8:08pm

Hi, was having reboots and error during heavy loads with big context and token loads. came to below recipe, seems stable now build for latest firmware and patch level gx10

# Qwen/Qwen3.6-35B-A3B model in native FP8 format



recipe_version: "1"

name: Qwen35-35B-A3B




# HuggingFace model to download (optional, for --download-model)

model: Qwen/Qwen3.6-35B-A3B-FP8


solo_only: true


# Container image to use

container: vllm-node-tf5


# Mod

mods:

  - mods/fix-qwen3.5-chat-template


defaults:

  port: 8000

  host: 0.0.0.0

  gpu_memory_utilization: 0.75   # Safe buffer for R590 driver overhead, also running other stuff 

  max_model_len: 131072          # 128k context is the stability sweet spot was having problems with 265k

  max_num_batched_tokens: 32768



env:

  # PyTorch/Triton Stability (Unquoted for cleanliness)

  TORCHINDUCTOR_MAX_AUTOTUNE: 0

  TRITON_MAX_AUTOTUNE: 0

  # Grace CPU pinning (Essential for the '94C CPU' reboot bug)

  OMP_NUM_THREADS: 16

  VLLM_CPU_OMP_THREADS_BIND: 1



command: |

  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \

    --served-model-name qwen35b \

    --host {host} \

    --port {port} \

    --max-model-len {max_model_len} \

    --max-num-batched-tokens {max_num_batched_tokens} \

    --gpu-memory-utilization {gpu_memory_utilization} \

    --enable-auto-tool-choice \

    --attention-backend flashinfer \

    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \

    --kv-cache-dtype fp8 \

    --enable-chunked-prefill \

    --disable-custom-all-reduce \

    --trust-remote-code \

    --load-format fastsafetensors \

    --tool-call-parser qwen3_xml \

    --reasoning-parser qwen3 \

    --enable-prefix-caching

Any tips ? Did I miss something, is there a better way. The INT-autoround gave to many errors for me. Maybe there is something I should have done better with that ? Any tips ? its quicker but feels very much less in quality. Might just be me. This is stable no reboots yet and max 71c temp. under heavy load and still 51ts

serapis · April 23, 2026, 9:35pm

I was not able to run this successfully with TP=2 (dual node setup). Did you have a chance to try dual node by chance?

grindstone · April 23, 2026, 10:04pm

I am bit fried, lets see what comes out of it :)

● Root cause Marlin kernel needs output_size_per_partition % 64 == 0. Our A3B expert weights split across TP=2 give 32 — Marlin rejects. Suggested fixes:
–quantization gptq or reduce TP.

Two fallback paths worth trying:

Pipeline parallel (PP=2) — splits layers across nodes, doesn’t split per-layer weights → no Marlin divisibility constraint

–quantization gptq on TP=2 — different kernel path

Bilabong007 · April 24, 2026, 6:31am

I tried this build and got some good results +56% throughput and +63% prefill on my best build.

mamtabankoti4 · April 24, 2026, 10:06am

Hey,

It really feels like things are moving quite fast these days, just when you begin to get stable results with one model, a newer version comes along.

From what you’re experiencing with Qwen3.5 and Gemma 4, it doesn’t seem like anything is going wrong on your side. These models can still be a little inconsistent when it comes to calling tools properly. At times they may choose the wrong tool, return an incorrect format, or not behave as expected.

The newer Qwen3.6-35B-A3B-FP8 does look quite promising, especially with improvements in handling coding tasks and maintaining context across steps. This could make things smoother, though it might be a good idea to test it gradually rather than switching everything at once. If you’d like a simple way to try and compare different tools in one place, you could also have a look at this link.

Since it follows a similar structure, it should work reasonably well with vLLM, although a few small adjustments might still be needed to get the best results.

kaweees · April 24, 2026, 12:36pm

I’ll create the recipe, what is your GitHub username so I can credit you?

Topic		Replies	Views
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	44	6471	April 27, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	342	10898	April 27, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	15152	March 24, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5016	March 16, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	18	1452	April 16, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	8849	March 24, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9324	April 9, 2026
Qwen3.5-35B-A3B optimizations on single Spark DGX Spark / GB10 Projects	39	1485	April 20, 2026
Bfloat16 Quality = Speed? DGX Spark / GB10	42	1476	April 26, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	229	7029	April 20, 2026

Qwen/Qwen3.6-35B-A3B (and FP8) has landed

Related topics