I use OpenCode. Thats what I use this patch with. Virtually no errors after making this change. Maybe its got to do with what the harness is expect. I donβt know why. Use your judgment with your own setup. If tool calls start failing you will know pretty quickly.
Start here: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks Β· GitHub
There are plenty of recipes already in the install, but you can add my Qwen 3.6 35B recipe+mod if you like. (I bet eugr will have an official recipe soon.)
I have tested Qwen3.6-35B with llama.cpp and vLLM, and the best result I got was with llama.cpp using the Vulkan backend (96).
ββββββββββββββββββββββββββββββββββ³ββββββββββββββββββββββββββββββββββββββββββ³βββββββββββ³βββββββββββββββββββ³βββββββββββββββββββββββ
β Run ID β Model β Score β Rating β Date β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β 2026-04-20T15-08-01Z_6a3f22 β Qwen/Qwen3.6-35B-A3B-FP8 β 92 β β
β
β
β
β
Excellent β 2026-04-20T15:22:58 β
β 2026-04-20T14-25-50Z_6a3f22 β Qwen/Qwen3.6-35B-A3B-FP8 β 91 β β
β
β
β
β
Excellent β 2026-04-20T14:40:55 β
β 2026-04-20T13-52-48Z_6a3f22 β Qwen/Qwen3.6-35B-A3B-FP8 β 91 β β
β
β
β
β
Excellent β 2026-04-20T14:06:20 β
β 2026-04-20T12-53-18Z_e8d504 β unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL β 93 β β
β
β
β
β
Excellent β 2026-04-20T13:08:28 β
β 2026-04-20T12-30-06Z_fd8ebc β unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M β 90 β β
β
β
β
β
Excellent β 2026-04-20T12:43:28 β
β 2026-04-20T12-15-29Z_179efc β mudler/Qwen3.6-35B-A3B-APEX-GGUF β 90 β β
β
β
β
β
Excellent β 2026-04-20T12:29:11 β
β 2026-04-20T11-57-49Z_179efc β mudler/Qwen3.6-35B-A3B-APEX-GGUF β 90 β β
β
β
β
β
Excellent β 2026-04-20T12:13:21 β
β 2026-04-20T11-29-55Z_54ddbe β unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL β 92 β β
β
β
β
β
Excellent β 2026-04-20T11:47:22 β
β 2026-04-20T09-29-07Z_e8d504 β unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL β 93 β β
β
β
β
β
Excellent β 2026-04-20T09:44:14 β
β 2026-04-20T08-12-45Z_e8d504 β unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL β 93 β β
β
β
β
β
Excellent β 2026-04-20T08:27:58 β
β 2026-04-20T07-56-44Z_e8d504 β unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL β 95 β β
β
β
β
β
Excellent β 2026-04-20T08:12:08 β
β 2026-04-20T07-41-48Z_849f0b β unsloth/Qwen3.6-35B-A3B-GGUF:MXFP4_MOE β 92 β β
β
β
β
β
Excellent β 2026-04-20T07:55:12 β
β 2026-04-20T07-26-54Z_849f0b β unsloth/Qwen3.6-35B-A3B-GGUF:MXFP4_MOE β 93 β β
β
β
β
β
Excellent β 2026-04-20T07:40:33 β
β 2026-04-20T07-00-56Z_fd8ebc β unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M β 96 β β
β
β
β
β
Excellent β 2026-04-20T07:14:04 β
β 2026-04-20T06-47-05Z_fd8ebc β unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M β 93 β β
β
β
β
β
Excellent β 2026-04-20T07:00:04 β
On a DGX Spark ?
Yeah, same here . Reason is quite simple, it affects the context. Essentially you are giving instructions to the model to fix itβs output. Think of it like micromanagement. In any case, if it works and thatβs what is necessary. For example, I had this in my system prompt βDo not make things up if you do not know the answerβ, and that does wonders for hallucination. :-D
Yes all this test done on ASUS GX10. Vulkan can be faster than CUDA for some models(on llama.cpp).
Sorry, not sure what results you are referring to ? TPS or ? What is βSCOREβ ?
tool-eval-bench result (Tool-Call Benchmark), half of the posts in this topic has this bench results.
Just trying to figure where the βfasterβ came from.
faster in PP and TG if we compare cuda and vulkan llama.cpp backends
Same, Iβm testing a few other combos today, but this ones does it for me. FP8, fast and MTP enabled.
Briefly tried Qwen3.6-27B this morning and itβs painfully slow. Not sure if thereβs an FP8 available, if there is, Iβll try it tonight. But I think I will stick with 3.6-35B-A3B-FP8 until a Quantized 122B is available. :)
FP8 has been available since 27B has launched. Qwen/Qwen3.6-27B-FP8 Β· Hugging Face
For the ones that like to go down the rabbit hole
Extended Calibration (EC) INT4 AutoRound quantization of Qwen/Qwen3.6-35B-A3B, a 35B MoE (3B active, 128 experts) multimodal model. Drop-in replacement for Intel/Qwen3.6-35B-A3B-int4-AutoRound with wider calibration settings for improved quality on long-context and reasoning-heavy workloads.
Passing the smoke test
β CLEAN: 87/100 β
β
β
β
Good. 0 connection errors.
EC vs FP8 head-to-head (both on spark1)
βββββββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββ
β metric β EC INT4 β FP8 β Ξ β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β Score β 87 β 90 β -3 (within 2Ο noise) β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β Pass/partial/fail β 54/12/3 β 57/10/2 β ~same β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β Quality β 87 β 90 β -3 β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β Responsiveness β 65 β 45 β +20 (EC faster) β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β Deployability β 80 β 76 β +4 (EC better) β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β TTFT single β 314 ms β 1344 ms β -76% (4.3Γ faster) β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β Single tg t/s β 71.3 β 52.3 β +36% β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β c2 tg t/s β 123.3 β 82.3 β +50% β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β c4 tg t/s β 173.3 β 104.4 β +66% β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β Median turn β 2.0 s β 3.8 s β -47% β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β Total eval β 608 s β 987 s β -38% β
βββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β Weakest cat β L Toolset Scale 62% β K Safety 77% β β β
βββββββββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββ
GPT got the Intel quant working with TP=2. A few notes:
- Have to patch in conch-triton-kernels. Did this with a separate dockerfile using a current TF5 image
FROM vllm-tf5:20260423
RUN uv pip install -U conch-triton-kernels
- Obviously rename it
docker build -t vllm-tf5:20260423-conch .
- Must set
--max-num-batched-tokens 2048 - Cannot use
--enable-prefix-caching
Hi, was having reboots and error during heavy loads with big context and token loads. came to below recipe, seems stable now build for latest firmware and patch level gx10
# Qwen/Qwen3.6-35B-A3B model in native FP8 format
recipe_version: "1"
name: Qwen35-35B-A3B
# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8
solo_only: true
# Container image to use
container: vllm-node-tf5
# Mod
mods:
- mods/fix-qwen3.5-chat-template
defaults:
port: 8000
host: 0.0.0.0
gpu_memory_utilization: 0.75 # Safe buffer for R590 driver overhead, also running other stuff
max_model_len: 131072 # 128k context is the stability sweet spot was having problems with 265k
max_num_batched_tokens: 32768
env:
# PyTorch/Triton Stability (Unquoted for cleanliness)
TORCHINDUCTOR_MAX_AUTOTUNE: 0
TRITON_MAX_AUTOTUNE: 0
# Grace CPU pinning (Essential for the '94C CPU' reboot bug)
OMP_NUM_THREADS: 16
VLLM_CPU_OMP_THREADS_BIND: 1
command: |
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name qwen35b \
--host {host} \
--port {port} \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--gpu-memory-utilization {gpu_memory_utilization} \
--enable-auto-tool-choice \
--attention-backend flashinfer \
--default-chat-template-kwargs '{{"preserve_thinking": true}}' \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--disable-custom-all-reduce \
--trust-remote-code \
--load-format fastsafetensors \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--enable-prefix-caching
Any tips ? Did I miss something, is there a better way. The INT-autoround gave to many errors for me. Maybe there is something I should have done better with that ? Any tips ? its quicker but feels very much less in quality. Might just be me. This is stable no reboots yet and max 71c temp. under heavy load and still 51ts
I was not able to run this successfully with TP=2 (dual node setup). Did you have a chance to try dual node by chance?
I am bit fried, lets see what comes out of it :)
β Root cause Marlin kernel needs output_size_per_partition % 64 == 0. Our A3B expert weights split across TP=2 give 32 β Marlin rejects. Suggested fixes:
βquantization gptq or reduce TP.
Two fallback paths worth trying:
Pipeline parallel (PP=2) β splits layers across nodes, doesnβt split per-layer weights β no Marlin divisibility constraint
βquantization gptq on TP=2 β different kernel path
I tried this build and got some good results +56% throughput and +63% prefill on my best build.
Hey,
It really feels like things are moving quite fast these days, just when you begin to get stable results with one model, a newer version comes along.
From what youβre experiencing with Qwen3.5 and Gemma 4, it doesnβt seem like anything is going wrong on your side. These models can still be a little inconsistent when it comes to calling tools properly. At times they may choose the wrong tool, return an incorrect format, or not behave as expected.
The newer Qwen3.6-35B-A3B-FP8 does look quite promising, especially with improvements in handling coding tasks and maintaining context across steps. This could make things smoother, though it might be a good idea to test it gradually rather than switching everything at once. If youβd like a simple way to try and compare different tools in one place, you could also have a look at this link.
Since it follows a similar structure, it should work reasonably well with vLLM, although a few small adjustments might still be needed to get the best results.
Iβll create the recipe, what is your GitHub username so I can credit you?