Vulkan as alternative backend for llama.cpp

I compared the CUDA and Vulkan backends in llama.cpp, and overall it doesn’t look that bad.
Yes, pp is always lower, but tg is at roughly the same level in most tests.

So the question is whether the CUDA implementation isn’t that good, or Vulkan actually isn’t that bad.

Qwen3-Coder-Next-UD-Q6_K_XL

model size params backend ngl fa dev mmap test t/s
qwen3next 80B.A3B Q6_K 68.09 GiB 79.67 B CUDA,Vulkan 99 0 CUDA0 0 pp512 1067.77 ± 6.33
qwen3next 80B.A3B Q6_K 68.09 GiB 79.67 B CUDA,Vulkan 99 0 CUDA0 0 tg128 36.48 ± 0.05
qwen3next 80B.A3B Q6_K 68.09 GiB 79.67 B CUDA,Vulkan 99 1 CUDA0 0 pp512 1062.46 ± 8.63
qwen3next 80B.A3B Q6_K 68.09 GiB 79.67 B CUDA,Vulkan 99 1 CUDA0 0 tg128 36.80 ± 0.05
qwen3next 80B.A3B Q6_K 68.09 GiB 79.67 B CUDA,Vulkan 99 0 Vulkan0 0 pp512 931.11 ± 18.88
qwen3next 80B.A3B Q6_K 68.09 GiB 79.67 B CUDA,Vulkan 99 0 Vulkan0 0 tg128 37.43 ± 0.25
qwen3next 80B.A3B Q6_K 68.09 GiB 79.67 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 938.06 ± 22.53
qwen3next 80B.A3B Q6_K 68.09 GiB 79.67 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 37.71 ± 0.25

Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-MXFP4_MOE

model size params backend ngl fa dev mmap test t/s
qwen3next 80B.A3B MXFP4 MoE 40.73 GiB 79.67 B CUDA,Vulkan 99 0 CUDA0 0 pp512 1441.97 ± 14.83
qwen3next 80B.A3B MXFP4 MoE 40.73 GiB 79.67 B CUDA,Vulkan 99 0 CUDA0 0 tg128 48.58 ± 0.13
qwen3next 80B.A3B MXFP4 MoE 40.73 GiB 79.67 B CUDA,Vulkan 99 1 CUDA0 0 pp512 1461.08 ± 17.49
qwen3next 80B.A3B MXFP4 MoE 40.73 GiB 79.67 B CUDA,Vulkan 99 1 CUDA0 0 tg128 48.91 ± 0.10
qwen3next 80B.A3B MXFP4 MoE 40.73 GiB 79.67 B CUDA,Vulkan 99 0 Vulkan0 0 pp512 1147.19 ± 4.43
qwen3next 80B.A3B MXFP4 MoE 40.73 GiB 79.67 B CUDA,Vulkan 99 0 Vulkan0 0 tg128 44.45 ± 0.37
qwen3next 80B.A3B MXFP4 MoE 40.73 GiB 79.67 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 1159.12 ± 24.79
qwen3next 80B.A3B MXFP4 MoE 40.73 GiB 79.67 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 43.07 ± 0.50

Qwen3.5-122B-A10B-GGUF_UD-Q5_K_XL

model size params backend ngl fa dev mmap test t/s
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B CUDA,Vulkan 99 0 CUDA0 0 pp512 645.58 ± 3.95
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B CUDA,Vulkan 99 0 CUDA0 0 tg128 20.68 ± 0.02
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B CUDA,Vulkan 99 1 CUDA0 0 pp512 663.31 ± 3.85
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B CUDA,Vulkan 99 1 CUDA0 0 tg128 20.79 ± 0.02
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B CUDA,Vulkan 99 0 Vulkan0 0 pp512 471.86 ± 2.16
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B CUDA,Vulkan 99 0 Vulkan0 0 tg128 18.20 ± 0.02
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 473.46 ± 4.02
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 18.54 ± 0.07

NVIDIA-Nemotron-3-Super-120B-A12B-GGUF_UD-Q4_K_XL

model size params backend ngl fa dev mmap test t/s
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA,Vulkan 99 0 CUDA0 0 pp512 524.26 ± 3.20
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA,Vulkan 99 0 CUDA0 0 tg128 16.13 ± 0.01
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA,Vulkan 99 1 CUDA0 0 pp512 525.09 ± 1.11
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA,Vulkan 99 1 CUDA0 0 tg128 16.13 ± 0.01
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA,Vulkan 99 0 Vulkan0 0 pp512 359.69 ± 3.03
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA,Vulkan 99 0 Vulkan0 0 tg128 5.33 ± 0.02
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 359.10 ± 5.42
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 5.36 ± 0.03

GLM-4.7-Flash-GGUF_GLM-4.7-Flash-MXFP4_MOE

model size params backend ngl fa dev mmap test t/s
deepseek2 30B.A3B MXFP4 MoE 15.79 GiB 29.94 B CUDA,Vulkan 99 0 CUDA0 0 pp512 2314.64 ± 68.49
deepseek2 30B.A3B MXFP4 MoE 15.79 GiB 29.94 B CUDA,Vulkan 99 0 CUDA0 0 tg128 57.06 ± 0.15
deepseek2 30B.A3B MXFP4 MoE 15.79 GiB 29.94 B CUDA,Vulkan 99 1 CUDA0 0 pp512 2565.49 ± 28.21
deepseek2 30B.A3B MXFP4 MoE 15.79 GiB 29.94 B CUDA,Vulkan 99 1 CUDA0 0 tg128 59.53 ± 0.14
deepseek2 30B.A3B MXFP4 MoE 15.79 GiB 29.94 B CUDA,Vulkan 99 0 Vulkan0 0 pp512 1656.79 ± 13.19
deepseek2 30B.A3B MXFP4 MoE 15.79 GiB 29.94 B CUDA,Vulkan 99 0 Vulkan0 0 tg128 56.87 ± 2.47
deepseek2 30B.A3B MXFP4 MoE 15.79 GiB 29.94 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 1658.20 ± 25.39
deepseek2 30B.A3B MXFP4 MoE 15.79 GiB 29.94 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 55.87 ± 0.24

gpt-oss-20b

model size params backend ngl fa dev mmap test t/s
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 0 CUDA0 0 pp512 3088.68 ± 47.77
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 0 CUDA0 0 tg128 66.65 ± 0.19
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 4005.98 ± 94.33
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 69.11 ± 0.14
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 0 Vulkan0 0 pp512 1778.36 ± 54.59
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 0 Vulkan0 0 tg128 57.64 ± 1.51
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 2070.19 ± 19.01
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 56.78 ± 0.50

Also env variable GGML_VK_PREFER_HOST_MEMORY=1 can add 10-15% in tg

with GGML_VK_PREFER_HOST_MEMORY=1

model size params backend ngl fa dev mmap test t/s
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 4110.87 ± 84.29
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 71.30 ± 0.09
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 @ d10000 3558.36 ± 68.06
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 @ d10000 64.76 ± 0.26
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 @ d65000 1828.87 ± 26.65
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 @ d65000 44.40 ± 0.01
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 @ d100000 1381.22 ± 13.35
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 @ d100000 37.17 ± 0.04
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 2097.93 ± 51.80
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 66.88 ± 0.20
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 @ d10000 1732.30 ± 24.81
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 @ d10000 62.21 ± 0.42
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 @ d65000 827.85 ± 7.20
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 @ d65000 45.51 ± 0.11
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 @ d100000 615.54 ± 2.85
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 @ d100000 38.72 ± 0.04

without GGML_VK_PREFER_HOST_MEMORY=1

model size params backend ngl fa dev mmap test t/s
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 4074.91 ± 93.79
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 71.73 ± 0.13
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 @ d10000 3585.54 ± 87.08
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 @ d10000 65.13 ± 0.11
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 @ d65000 1811.16 ± 34.05
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 @ d65000 44.49 ± 0.01
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 @ d100000 1382.26 ± 16.47
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 @ d100000 37.10 ± 0.03
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 2058.68 ± 30.76
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 60.06 ± 4.24
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 @ d10000 1712.20 ± 19.11
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 @ d10000 55.21 ± 0.87
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 @ d65000 825.18 ± 4.88
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 @ d65000 40.92 ± 1.42
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 @ d100000 614.77 ± 1.39
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 @ d100000 35.07 ± 0.68

It’s the latter.

Vulkan is not only a solid compute platform but there is also support for “cooperative matrix” operations (e.g. tensor cores / wmma / etc. on NVIDIA).

By default with headers form repo we i have
ggml_vulkan: 0 = NVIDIA GB10 (NVIDIA) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat

With vulkan sdk srom source and
ggml_vulkan: 0 = NVIDIA GB10 (NVIDIA) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
And it’s add +10-15% in tg

model size params backend ngl fa dev mmap test t/s
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 4155.66 ± 54.19
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 73.20 ± 0.10
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 pp512 @ d10000 3646.81 ± 55.16
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 CUDA0 0 tg128 @ d10000 66.57 ± 0.13
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 2764.43 ± 28.25
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 68.00 ± 0.26
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 pp512 @ d10000 2533.45 ± 13.34
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 1 Vulkan0 0 tg128 @ d10000 63.71 ± 0.28

With last llama.cpp build vulkan can outperform cuda for qwen3 models

unsloth/Qwen3.5-35B-A3B-GGUF:Q6_K

model size params backend ngl n_batch fa dev mmap test t/s
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 pp2048 1834.98 ± 5.66
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 tg256 59.77 ± 0.07
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 pp2048 @ d10000 1706.84 ± 2.01
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 tg256 @ d10000 55.63 ± 0.07
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 pp2048 @ d64000 1350.14 ± 4.41
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 tg256 @ d64000 42.87 ± 0.04
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 pp2048 1915.37 ± 19.92
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 tg256 59.89 ± 0.01
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 pp2048 @ d10000 1744.88 ± 6.58
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 tg256 @ d10000 56.47 ± 0.06
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 pp2048 @ d64000 1269.11 ± 1.66
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 tg256 @ d64000 44.73 ± 0.01

unsloth/Qwen3-Coder-Next-GGUF:UD-Q6_K_XL
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | CUDA0 | 0 | pp2048 | 1087.56 ± 2.79 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | CUDA0 | 0 | tg256 | 37.58 ± 0.02 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | CUDA0 | 0 | pp2048 @ d10000 | 1052.41 ± 3.55 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | CUDA0 | 0 | tg256 @ d10000 | 35.46 ± 0.03 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | CUDA0 | 0 | pp2048 @ d64000 | 899.92 ± 1.39 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | CUDA0 | 0 | tg256 @ d64000 | 28.56 ± 0.30 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | Vulkan0 | 0 | pp2048 | 1221.80 ± 11.76 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | Vulkan0 | 0 | tg256 | 44.29 ± 0.06 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | Vulkan0 | 0 | pp2048 @ d10000 | 1130.06 ± 2.44 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | Vulkan0 | 0 | tg256 @ d10000 | 42.06 ± 0.04 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | Vulkan0 | 0 | pp2048 @ d64000 | 887.88 ± 3.36 |
| qwen3next 80B.A3B Q6_K | 68.09 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1 | Vulkan0 | 0 | tg256 @ d64000 | 33.82 ± 0.01 |

unsloth/Qwen3.5-35B-A3B-GGUF:Q6_K

model size params backend ngl n_batch fa dev mmap test t/s
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 pp2048 1819.41 ± 4.44
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 tg256 59.67 ± 0.07
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 pp2048 @ d10000 1683.18 ± 7.23
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 tg256 @ d10000 55.35 ± 0.04
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 pp2048 @ d64000 1363.08 ± 1.90
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 CUDA0 0 tg256 @ d64000 42.83 ± 0.04
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 pp2048 1931.55 ± 18.43
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 tg256 59.35 ± 0.02
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 pp2048 @ d10000 1736.51 ± 5.05
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 tg256 @ d10000 55.95 ± 0.04
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 pp2048 @ d64000 1277.84 ± 4.16
qwen35moe 35B.A3B Q6_K 26.86 GiB 34.66 B CUDA,Vulkan 99 512 1 Vulkan0 0 tg256 @ d64000 44.40 ± 0.03

but for gpt-oss 20b it is the same

model size params backend ngl n_batch fa dev mmap test t/s
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 CUDA0 0 pp2048 4180.38 ± 14.68
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 CUDA0 0 tg256 72.29 ± 0.11
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 CUDA0 0 pp2048 @ d10000 3592.74 ± 9.69
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 CUDA0 0 tg256 @ d10000 65.74 ± 0.22
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 CUDA0 0 pp2048 @ d64000 2009.12 ± 3.31
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 CUDA0 0 tg256 @ d64000 46.61 ± 0.03
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 Vulkan0 0 pp2048 2779.13 ± 15.28
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 Vulkan0 0 tg256 67.07 ± 0.19
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 Vulkan0 0 pp2048 @ d10000 2538.70 ± 11.19
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 Vulkan0 0 tg256 @ d10000 62.31 ± 0.10
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 Vulkan0 0 pp2048 @ d64000 1705.03 ± 1.16
gpt-oss 20B MXFP4 MoE 11.77 GiB 20.91 B CUDA,Vulkan 99 512 1 Vulkan0 0 tg256 @ d64000 46.79 ± 0.06

There is also a difference in tool-eval-bench with the same model and the same settings.
CUDA score 95
Vulkan score 96

uvx tool-eval-bench --compare 2026-04-23T11-25-05Z_e58296 2026-04-23T11-09-34Z_e58296

╭─────────────────────────────────────────────────────────────────────────────────── 📊 Run Comparison ────────────────────────────────────────────────────────────────────────────────────╮
A (baseline): 2026-04-23T11-25-05Z_e58296 model=unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL │
│ tool-eval-bench: v1.4.0 │
│ Engine: llama.cpp b8901-8635e221c │
│ Quantization: Q6_K_X │
│ Temperature: 0.2 │
│ Seed: 42 │
│ Thinking: enabled │
│ Host: gx10-8990 │
B (current): 2026-04-23T11-09-34Z_e58296 model=unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL │
│ tool-eval-bench: v1.4.0 │
│ Engine: llama.cpp b8901-8635e221c │
│ Quantization: Q6_K_X │
│ Temperature: 0.2 │
│ Seed: 42 │
│ Thinking: enabled │
│ Host: gx10-8990 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

┏━━━━━━━━┳━━━━━━━━━━┳━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
IDABΔTime ΔNote
┡━━━━━━━━╇━━━━━━━━━━╇━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ TC-01 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.8s │ │
│ TC-02 │ ✅ 2 │ → │ ✅ 2 │ = │ +4.6s │ │
│ TC-03 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.8s │ │
│ TC-04 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.3s │ │
│ TC-05 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.2s │ │
│ TC-06 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.6s │ │
│ TC-07 │ ✅ 2 │ → │ ✅ 2 │ = │ +2.0s │ │
│ TC-08 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.2s │ │
│ TC-09 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.2s │ │
│ TC-10 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.2s │ │
│ TC-11 │ ✅ 2 │ → │ ✅ 2 │ = │ +3.0s │ │
│ TC-12 │ ✅ 2 │ → │ ✅ 2 │ = │ -0.4s │ │
│ TC-13 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.7s │ │
│ TC-14 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.4s │ │
│ TC-15 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.4s │ │
│ TC-16 │ ✅ 2 │ → │ ✅ 2 │ = │ -3.7s │ │
│ TC-17 │ ✅ 2 │ → │ ✅ 2 │ = │ +15.5s │ │
│ TC-18 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.5s │ │
│ TC-19 │ ✅ 2 │ → │ ✅ 2 │ = │ +2.3s │ │
│ TC-20 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.1s │ │
│ TC-21 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.4s │ │
│ TC-22 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.4s │ │
│ TC-23 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.9s │ │
│ TC-24 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.7s │ │
│ TC-25 │ ✅ 2 │ → │ ✅ 2 │ = │ -0.2s │ │
│ TC-26 │ ✅ 2 │ → │ ✅ 2 │ = │ +3.0s │ │
│ TC-27 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.5s │ │
│ TC-28 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.9s │ │
│ TC-29 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.7s │ │
│ TC-30 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.0s │ │
│ TC-31 │ ✅ 2 │ → │ ✅ 2 │ = │ +2.4s │ │
│ TC-32 │ ✅ 2 │ → │ ✅ 2 │ = │ +2.0s │ │
│ TC-33 │ ✅ 2 │ → │ ✅ 2 │ = │ -3.2s │ │
│ TC-34 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.7s │ │
│ TC-35 │ ⚠ 1 │ → │ ⚠ 1 │ = │ +13.5s │ │
│ TC-36 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.5s │ │
│ TC-37 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.9s │ │
│ TC-38 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.7s │ │
│ TC-39 │ ⚠ 1 │ → │ ⚠ 1 │ = │ +0.4s │ │
│ TC-40 │ ✅ 2 │ → │ ✅ 2 │ = │ +3.5s │ │
│ TC-41 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.9s │ │
│ TC-42 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.3s │ │
│ TC-43 │ ✅ 2 │ → │ ✅ 2 │ = │ -1.3s │ │
│ TC-44 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.2s │ │
│ TC-45 │ ✅ 2 │ → │ ✅ 2 │ = │ +16.2s │ │
│ TC-46 │ ⚠ 1 │ → │ ⚠ 1 │ = │ +4.7s │ │
│ TC-47 │ ⚠ 1 │ → │ ⚠ 1 │ = │ +14.1s │ │
│ TC-48 │ ✅ 2 │ → │ ✅ 2 │ = │ +13.9s │ │
│ TC-49 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.3s │ │
│ TC-50 │ ✅ 2 │ → │ ✅ 2 │ = │ +12.5s │ │
│ TC-51 │ ⚠ 1 │ → │ ✅ 2+1 │ +2.9s │ improved │
│ TC-52 │ ✅ 2 │ → │ ✅ 2 │ = │ -1.6s │ │
│ TC-53 │ ✅ 2 │ → │ ✅ 2 │ = │ -0.4s │ │
│ TC-54 │ ✅ 2 │ → │ ✅ 2 │ = │ +2.9s │ │
│ TC-55 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.4s │ │
│ TC-56 │ ⚠ 1 │ → │ ⚠ 1 │ = │ +1.8s │ │
│ TC-57 │ ✅ 2 │ → │ ✅ 2 │ = │ +3.1s │ │
│ TC-58 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.2s │ │
│ TC-59 │ ✅ 2 │ → │ ✅ 2 │ = │ +3.4s │ │
│ TC-60 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.4s │ │
│ TC-61 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.8s │ │
│ TC-62 │ ⚠ 1 │ → │ ✅ 2+1 │ +10.2s │ improved │
│ TC-63 │ ✅ 2 │ → │ ✅ 2 │ = │ -0.3s │ │
│ TC-64 │ ✅ 2 │ → │ ✅ 2 │ = │ +2.7s │ │
│ TC-65 │ ✅ 2 │ → │ ✅ 2 │ = │ +0.8s │ │
│ TC-66 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.1s │ │
│ TC-67 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.3s │ │
│ TC-68 │ ✅ 2 │ → │ ✅ 2 │ = │ +3.2s │ │
│ TC-69 │ ✅ 2 │ → │ ✅ 2 │ = │ +1.8s │ │
└────────┴──────────┴─────┴──────────┴────────┴──────────┴──────────┘
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ↑ 2 improved ↓ 0 regressed = 67 unchanged │
Points: 131 → 133 (+2)
Score: 95 → 96
╰───────────────────────────────────────────────────────────