Qwen 3.5 SLM on DGX GB10

First tests with @eugr VLLM image for 9B model on llama benchy with MTP.
Qwen/Qwen3.5-9B
Do you guys have better results ?
Best Recipe ?

model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-9B pp2048 (c1) 2620.55 ± 108.88 2620.55 ± 108.88 784.91 ± 33.23 783.02 ± 33.23 785.00 ± 33.22
Qwen/Qwen3.5-9B tg128 (c1) 9.68 ± 0.24 9.68 ± 0.24 10.67 ± 0.47 10.67 ± 0.47
Qwen/Qwen3.5-9B pp2048 (c2) 3149.17 ± 65.54 1577.58 ± 32.55 1301.05 ± 26.31 1299.16 ± 26.31 1301.10 ± 26.31
Qwen/Qwen3.5-9B tg128 (c2) 18.51 ± 0.40 9.79 ± 0.17 22.00 ± 0.00 11.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d4096 (c1) 3172.55 ± 235.93 3172.55 ± 235.93 1300.38 ± 97.06 1298.49 ± 97.06 1300.49 ± 97.07
Qwen/Qwen3.5-9B ctx_tg @ d4096 (c1) 9.92 ± 0.22 9.92 ± 0.22 11.00 ± 0.00 11.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 @ d4096 (c1) 1243.76 ± 6.36 1243.76 ± 6.36 1648.56 ± 8.45 1646.67 ± 8.45 1648.66 ± 8.45
Qwen/Qwen3.5-9B tg128 @ d4096 (c1) 10.09 ± 0.02 10.09 ± 0.02 11.00 ± 0.00 11.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d4096 (c2) 3883.72 ± 15.81 1943.85 ± 7.99 2109.68 ± 8.86 2107.79 ± 8.86 2109.73 ± 8.85
Qwen/Qwen3.5-9B ctx_tg @ d4096 (c2) 19.40 ± 0.59 10.05 ± 0.03 22.00 ± 0.00 11.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 @ d4096 (c2) 1304.04 ± 37.92 652.49 ± 18.98 3143.33 ± 93.21 3141.44 ± 93.21 3143.39 ± 93.21
Qwen/Qwen3.5-9B tg128 @ d4096 (c2) 18.82 ± 0.91 9.91 ± 0.16 22.00 ± 0.00 11.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d8192 (c1) 3820.75 ± 15.57 3820.75 ± 15.57 2146.18 ± 8.64 2144.29 ± 8.64 2146.25 ± 8.62
Qwen/Qwen3.5-9B ctx_tg @ d8192 (c1) 9.97 ± 0.00 9.97 ± 0.00 11.00 ± 0.00 11.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 @ d8192 (c1) 775.86 ± 3.93 775.86 ± 3.93 2641.61 ± 13.35 2639.72 ± 13.35 2641.66 ± 13.34
Qwen/Qwen3.5-9B tg128 @ d8192 (c1) 9.89 ± 0.01 9.89 ± 0.01 10.33 ± 0.47 10.33 ± 0.47
Qwen/Qwen3.5-9B ctx_pp @ d8192 (c2) 4077.40 ± 3.08 2039.80 ± 1.56 4018.39 ± 2.99 4016.50 ± 2.99 4018.44 ± 2.99
Qwen/Qwen3.5-9B ctx_tg @ d8192 (c2) 18.92 ± 0.11 9.78 ± 0.11 21.33 ± 0.94 10.67 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d8192 (c2) 816.36 ± 1.05 408.37 ± 0.53 5016.98 ± 6.46 5015.09 ± 6.46 5017.04 ± 6.46
Qwen/Qwen3.5-9B tg128 @ d8192 (c2) 18.48 ± 0.65 9.72 ± 0.02 20.00 ± 0.00 10.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d16384 (c1) 3927.25 ± 5.82 3927.25 ± 5.82 4173.86 ± 6.41 4171.97 ± 6.41 4173.96 ± 6.40
Qwen/Qwen3.5-9B ctx_tg @ d16384 (c1) 9.57 ± 0.02 9.57 ± 0.02 10.00 ± 0.00 10.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 @ d16384 (c1) 436.06 ± 0.63 436.06 ± 0.63 4698.53 ± 6.79 4696.64 ± 6.79 4698.64 ± 6.77
Qwen/Qwen3.5-9B tg128 @ d16384 (c1) 9.51 ± 0.02 9.51 ± 0.02 10.00 ± 0.00 10.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d16384 (c2) 4050.36 ± 66.34 2036.83 ± 52.49 8051.44 ± 202.01 8049.55 ± 202.01 8051.50 ± 202.02
Qwen/Qwen3.5-9B ctx_tg @ d16384 (c2) 18.58 ± 0.09 9.45 ± 0.07 20.33 ± 0.47 10.33 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d16384 (c2) 447.29 ± 5.77 224.80 ± 4.71 9116.12 ± 187.59 9114.23 ± 187.59 9116.16 ± 187.59
Qwen/Qwen3.5-9B tg128 @ d16384 (c2) 18.20 ± 0.03 9.35 ± 0.13 20.33 ± 0.47 10.33 ± 0.47
Qwen/Qwen3.5-9B ctx_pp @ d32768 (c1) 3906.92 ± 2.15 3906.92 ± 2.15 8389.14 ± 4.70 8387.25 ± 4.70 8389.22 ± 4.70
Qwen/Qwen3.5-9B ctx_tg @ d32768 (c1) 8.65 ± 0.02 8.65 ± 0.02 9.00 ± 0.00 9.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 @ d32768 (c1) 224.00 ± 1.87 224.00 ± 1.87 9145.26 ± 75.94 9143.37 ± 75.94 9145.36 ± 75.96
Qwen/Qwen3.5-9B tg128 @ d32768 (c1) 8.61 ± 0.02 8.61 ± 0.02 9.00 ± 0.00 9.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d32768 (c2) 3712.80 ± 281.87 1874.29 ± 154.97 17612.52 ± 1538.50 17610.63 ± 1538.50 17612.58 ± 1538.51
Qwen/Qwen3.5-9B ctx_tg @ d32768 (c2) 16.27 ± 0.66 8.64 ± 0.20 18.00 ± 0.00 9.33 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d32768 (c2) 216.48 ± 15.33 109.29 ± 8.49 18861.48 ± 1543.33 18859.59 ± 1543.33 18861.53 ± 1543.32
Qwen/Qwen3.5-9B tg128 @ d32768 (c2) 16.14 ± 0.76 8.55 ± 0.25 18.00 ± 0.00 9.17 ± 0.37
Qwen/Qwen3.5-9B ctx_pp @ d65535 (c1) 3157.82 ± 119.30 3157.82 ± 119.30 20785.76 ± 803.08 20783.87 ± 803.08 20785.85 ± 803.07
Qwen/Qwen3.5-9B ctx_tg @ d65535 (c1) 7.72 ± 0.09 7.72 ± 0.09 8.33 ± 0.47 8.33 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d65535 (c1) 91.90 ± 0.46 91.90 ± 0.46 22286.91 ± 111.65 22285.02 ± 111.65 22286.99 ± 111.63
Qwen/Qwen3.5-9B tg128 @ d65535 (c1) 7.64 ± 0.09 7.64 ± 0.09 8.33 ± 0.47 8.33 ± 0.47
Qwen/Qwen3.5-9B ctx_pp @ d65535 (c2) 2286.44 ± 491.33 1155.57 ± 254.76 60172.33 ± 15685.93 60170.44 ± 15685.93 60175.67 ± 15685.28
Qwen/Qwen3.5-9B ctx_tg @ d65535 (c2) 7.94 ± 1.11 4.42 ± 0.36 14.67 ± 0.94 7.33 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d65535 (c2) 80.29 ± 2.96 40.45 ± 1.46 50694.61 ± 1874.98 50692.72 ± 1874.98 50698.34 ± 1874.29
Qwen/Qwen3.5-9B tg128 @ d65535 (c2) 10.07 ± 1.40 5.35 ± 0.89 15.33 ± 0.94 7.67 ± 0.47
Qwen/Qwen3.5-9B ctx_pp @ d100000 (c1) 2407.21 ± 33.79 2407.21 ± 33.79 41552.12 ± 578.07 41550.23 ± 578.07 41582.82 ± 599.50
Qwen/Qwen3.5-9B ctx_tg @ d100000 (c1) 4.31 ± 0.81 4.31 ± 0.81 7.67 ± 1.25 7.67 ± 1.25
Qwen/Qwen3.5-9B pp2048 @ d100000 (c1) 44.33 ± 4.38 44.33 ± 4.38 46680.51 ± 4856.03 46678.62 ± 4856.03 46702.83 ± 4849.40
Qwen/Qwen3.5-9B tg128 @ d100000 (c1) 5.64 ± 0.49 5.64 ± 0.49 7.67 ± 0.94 7.67 ± 0.94
Qwen/Qwen3.5-9B ctx_pp @ d100000 (c2) 2328.80 ± 182.23 1200.93 ± 76.42 83646.13 ± 5896.07 83644.24 ± 5896.07 83653.27 ± 5892.10
Qwen/Qwen3.5-9B ctx_tg @ d100000 (c2) 5.98 ± 4.08 4.48 ± 2.12 11.67 ± 4.78 7.06 ± 2.53
Qwen/Qwen3.5-9B pp2048 @ d100000 (c2) 48.26 ± 0.91 24.32 ± 0.50 84264.99 ± 1754.30 84263.10 ± 1754.30 84269.39 ± 1752.62
Qwen/Qwen3.5-9B tg128 @ d100000 (c2) 8.28 ± 1.01 4.56 ± 0.57 14.67 ± 0.94 7.33 ± 0.47

Wow! I was hoping to see much stronger numbers from this smaller Qwen3.5 model. I hope we’ll see improvements here as people figure out how to optimize this one.

without MTP :

model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-9B pp2048 (c1) 3027.19 ± 277.93 3027.19 ± 277.93 684.28 ± 66.69 682.88 ± 66.69 684.34 ± 66.68
Qwen/Qwen3.5-9B tg128 (c1) 12.19 ± 0.14 12.19 ± 0.14 13.00 ± 0.00 13.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 (c2) 2478.75 ± 1386.40 1274.87 ± 718.72 3388.35 ± 3162.12 3386.95 ± 3162.12 3388.39 ± 3162.11
Qwen/Qwen3.5-9B tg128 (c2) 25.94 ± 0.20 13.07 ± 0.10 28.00 ± 0.00 14.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d4096 (c1) 3420.58 ± 132.87 3420.58 ± 132.87 1200.92 ± 45.56 1199.52 ± 45.56 1201.00 ± 45.56
Qwen/Qwen3.5-9B ctx_tg @ d4096 (c1) 12.33 ± 0.46 12.33 ± 0.46 13.67 ± 0.47 13.67 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d4096 (c1) 1171.72 ± 33.99 1171.72 ± 33.99 1750.70 ± 50.06 1749.30 ± 50.06 1750.82 ± 50.07
Qwen/Qwen3.5-9B tg128 @ d4096 (c1) 12.59 ± 0.41 12.59 ± 0.41 14.00 ± 0.00 14.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d4096 (c2) 3655.03 ± 77.51 2156.11 ± 330.92 1947.50 ± 298.42 1946.10 ± 298.42 1947.57 ± 298.41
Qwen/Qwen3.5-9B ctx_tg @ d4096 (c2) 25.28 ± 0.56 13.14 ± 0.41 28.67 ± 0.94 14.33 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d4096 (c2) 1205.34 ± 6.35 806.51 ± 224.95 2730.47 ± 691.18 2729.07 ± 691.18 2730.55 ± 691.19
Qwen/Qwen3.5-9B tg128 @ d4096 (c2) 23.09 ± 0.47 12.49 ± 0.70 28.00 ± 0.00 14.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d8192 (c1) 3586.40 ± 99.22 3586.40 ± 99.22 2287.51 ± 62.75 2286.11 ± 62.75 2287.58 ± 62.73
Qwen/Qwen3.5-9B ctx_tg @ d8192 (c1) 12.28 ± 0.30 12.28 ± 0.30 13.67 ± 0.47 13.67 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d8192 (c1) 722.49 ± 16.18 722.49 ± 16.18 2837.44 ± 63.15 2836.04 ± 63.15 2837.52 ± 63.14
Qwen/Qwen3.5-9B tg128 @ d8192 (c1) 12.48 ± 0.20 12.48 ± 0.20 13.33 ± 0.47 13.33 ± 0.47
Qwen/Qwen3.5-9B ctx_pp @ d8192 (c2) 3635.95 ± 105.84 2482.97 ± 689.01 3565.48 ± 960.03 3564.08 ± 960.03 3565.56 ± 960.02
Qwen/Qwen3.5-9B ctx_tg @ d8192 (c2) 22.63 ± 1.17 12.62 ± 1.07 29.33 ± 0.94 14.67 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d8192 (c2) 732.32 ± 10.68 489.78 ± 123.73 4467.80 ± 1128.34 4466.41 ± 1128.34 4467.88 ± 1128.34
Qwen/Qwen3.5-9B tg128 @ d8192 (c2) 21.73 ± 0.53 12.35 ± 1.15 28.67 ± 0.94 14.33 ± 0.47
Qwen/Qwen3.5-9B ctx_pp @ d16384 (c1) 3373.37 ± 86.53 3373.37 ± 86.53 4861.87 ± 125.59 4860.48 ± 125.59 4861.95 ± 125.59
Qwen/Qwen3.5-9B ctx_tg @ d16384 (c1) 11.78 ± 0.38 11.78 ± 0.38 12.67 ± 0.47 12.67 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d16384 (c1) 365.65 ± 8.59 365.65 ± 8.59 5605.47 ± 130.43 5604.07 ± 130.43 5605.54 ± 130.43
Qwen/Qwen3.5-9B tg128 @ d16384 (c1) 11.72 ± 0.19 11.72 ± 0.19 13.00 ± 0.00 13.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d16384 (c2) 3532.16 ± 75.94 2447.23 ± 685.91 7263.97 ± 2027.19 7262.57 ± 2027.19 7264.04 ± 2027.18
Qwen/Qwen3.5-9B ctx_tg @ d16384 (c2) 18.77 ± 0.13 11.62 ± 1.77 28.00 ± 0.00 14.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 @ d16384 (c2) 386.24 ± 4.57 277.98 ± 85.18 8128.54 ± 2481.68 8127.14 ± 2481.68 8128.62 ± 2481.68
Qwen/Qwen3.5-9B tg128 @ d16384 (c2) 17.52 ± 0.31 11.26 ± 2.04 28.00 ± 0.00 14.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d32768 (c1) 3303.76 ± 62.77 3303.76 ± 62.77 9923.64 ± 187.04 9922.24 ± 187.04 9923.73 ± 187.07
Qwen/Qwen3.5-9B ctx_tg @ d32768 (c1) 12.01 ± 0.33 12.01 ± 0.33 13.00 ± 0.00 13.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 @ d32768 (c1) 192.10 ± 3.98 192.10 ± 3.98 10666.83 ± 218.26 10665.43 ± 218.26 10666.92 ± 218.25
Qwen/Qwen3.5-9B tg128 @ d32768 (c1) 12.34 ± 0.32 12.34 ± 0.32 13.00 ± 0.00 13.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d32768 (c2) 3276.51 ± 87.01 2405.53 ± 778.11 15189.52 ± 4858.44 15188.12 ± 4858.44 15189.60 ± 4858.46
Qwen/Qwen3.5-9B ctx_tg @ d32768 (c2) 12.99 ± 0.19 9.90 ± 2.96 28.00 ± 0.00 14.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 @ d32768 (c2) 192.45 ± 2.53 138.89 ± 42.70 16285.27 ± 5006.68 16283.87 ± 5006.68 16285.36 ± 5006.68
Qwen/Qwen3.5-9B tg128 @ d32768 (c2) 12.90 ± 0.21 10.01 ± 3.10 28.00 ± 0.00 14.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d65535 (c1) 2955.81 ± 26.20 2955.81 ± 26.20 22175.06 ± 197.79 22173.66 ± 197.79 22175.12 ± 197.79
Qwen/Qwen3.5-9B ctx_tg @ d65535 (c1) 12.04 ± 0.04 12.04 ± 0.04 13.00 ± 0.00 13.00 ± 0.00
Qwen/Qwen3.5-9B pp2048 @ d65535 (c1) 87.65 ± 0.46 87.65 ± 0.46 23368.00 ± 122.53 23366.61 ± 122.53 23368.08 ± 122.52
Qwen/Qwen3.5-9B tg128 @ d65535 (c1) 12.08 ± 0.02 12.08 ± 0.02 13.00 ± 0.00 13.00 ± 0.00
Qwen/Qwen3.5-9B ctx_pp @ d65535 (c2) 2867.11 ± 27.46 2149.51 ± 716.07 34296.70 ± 11427.47 34295.30 ± 11427.47 34296.80 ± 11427.46
Qwen/Qwen3.5-9B ctx_tg @ d65535 (c2) 7.68 ± 0.08 8.31 ± 4.12 26.67 ± 0.94 13.33 ± 0.47
Qwen/Qwen3.5-9B pp2048 @ d65535 (c2) 86.04 ± 1.18 64.44 ± 21.43 35737.43 ± 11886.51 35736.03 ± 11886.51 35737.51 ± 11886.51
Qwen/Qwen3.5-9B tg128 @ d65535 (c2) 7.43 ± 0.20 8.09 ± 4.05 26.67 ± 0.94 13.33 ± 0.47

I’m playing around with intel autoround right now to try to quant the full weights down on the spark.

I’ll let you know what I get.

This is an unquantized model, bf16. 18GB of weights, so about right (without MTP).

Hope to see in few days FP8 :)

4B MODEL FP16 :

model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-4B pp2048 (c1) 3103.66 ± 2103.70 3103.66 ± 2103.70 2670.56 ± 3111.01 2669.73 ± 3111.01 2670.62 ± 3111.02
Qwen/Qwen3.5-4B tg128 (c1) 20.17 ± 0.11 20.17 ± 0.11 21.00 ± 0.00 21.00 ± 0.00
Qwen/Qwen3.5-4B pp2048 (c2) 5828.39 ± 112.76 2989.74 ± 89.97 686.86 ± 20.67 686.02 ± 20.67 686.90 ± 20.64
Qwen/Qwen3.5-4B tg128 (c2) 47.54 ± 1.08 23.95 ± 0.54 50.00 ± 0.00 25.00 ± 0.00
Qwen/Qwen3.5-4B ctx_pp @ d4096 (c1) 6069.89 ± 293.33 6069.89 ± 293.33 677.30 ± 32.12 676.47 ± 32.12 677.38 ± 32.11
Qwen/Qwen3.5-4B ctx_tg @ d4096 (c1) 20.09 ± 0.09 20.09 ± 0.09 21.00 ± 0.00 21.00 ± 0.00
Qwen/Qwen3.5-4B pp2048 @ d4096 (c1) 2045.73 ± 48.66 2045.73 ± 48.66 1002.50 ± 23.43 1001.66 ± 23.43 1002.56 ± 23.44
Qwen/Qwen3.5-4B tg128 @ d4096 (c1) 19.99 ± 0.06 19.99 ± 0.06 21.00 ± 0.00 21.00 ± 0.00
Qwen/Qwen3.5-4B ctx_pp @ d4096 (c2) 5967.57 ± 152.35 3494.49 ± 522.00 1199.85 ± 177.79 1199.02 ± 177.79 1199.88 ± 177.77
Qwen/Qwen3.5-4B ctx_tg @ d4096 (c2) 44.09 ± 1.41 22.96 ± 0.91 50.00 ± 0.00 25.00 ± 0.00
Qwen/Qwen3.5-4B pp2048 @ d4096 (c2) 1997.85 ± 8.07 1271.21 ± 272.14 1689.20 ± 361.16 1688.37 ± 361.16 1689.25 ± 361.12
Qwen/Qwen3.5-4B tg128 @ d4096 (c2) 40.84 ± 0.29 22.01 ± 1.10 48.67 ± 0.94 24.33 ± 0.47
Qwen/Qwen3.5-4B ctx_pp @ d8192 (c1) 5666.08 ± 279.78 5666.08 ± 279.78 1450.23 ± 69.19 1449.39 ± 69.19 1450.29 ± 69.18
Qwen/Qwen3.5-4B ctx_tg @ d8192 (c1) 19.13 ± 0.47 19.13 ± 0.47 19.67 ± 0.47 19.67 ± 0.47
Qwen/Qwen3.5-4B pp2048 @ d8192 (c1) 1181.81 ± 41.90 1181.81 ± 41.90 1735.98 ± 62.42 1735.14 ± 62.42 1736.05 ± 62.42
Qwen/Qwen3.5-4B tg128 @ d8192 (c1) 19.01 ± 0.51 19.01 ± 0.51 20.33 ± 0.47 20.33 ± 0.47
Qwen/Qwen3.5-4B ctx_pp @ d8192 (c2) 5993.04 ± 197.75 3933.35 ± 943.44 2210.90 ± 531.16 2210.07 ± 531.16 2210.94 ± 531.13
Qwen/Qwen3.5-4B ctx_tg @ d8192 (c2) 39.15 ± 0.38 21.79 ± 1.59 49.00 ± 0.82 24.50 ± 0.50
Qwen/Qwen3.5-4B pp2048 @ d8192 (c2) 1196.35 ± 25.65 860.46 ± 273.55 2634.70 ± 804.00 2633.86 ± 804.00 2634.73 ± 803.98
Qwen/Qwen3.5-4B tg128 @ d8192 (c2) 35.92 ± 0.38 20.92 ± 2.22 48.00 ± 0.00 24.00 ± 0.00
Qwen/Qwen3.5-4B ctx_pp @ d16384 (c1) 5692.09 ± 29.75 5692.09 ± 29.75 2879.53 ± 15.06 2878.69 ± 15.06 2879.60 ± 15.07
Qwen/Qwen3.5-4B ctx_tg @ d16384 (c1) 19.72 ± 0.03 19.72 ± 0.03 20.00 ± 0.00 20.00 ± 0.00
Qwen/Qwen3.5-4B pp2048 @ d16384 (c1) 628.51 ± 4.01 628.51 ± 4.01 3259.47 ± 20.84 3258.63 ± 20.84 3259.53 ± 20.84
Qwen/Qwen3.5-4B tg128 @ d16384 (c1) 19.72 ± 0.06 19.72 ± 0.06 20.00 ± 0.00 20.00 ± 0.00
Qwen/Qwen3.5-4B ctx_pp @ d16384 (c2) 5794.08 ± 25.21 4034.53 ± 1137.36 4412.54 ± 1243.47 4411.71 ± 1243.47 4412.57 ± 1243.46
Qwen/Qwen3.5-4B ctx_tg @ d16384 (c2) 31.72 ± 0.09 19.86 ± 3.14 48.00 ± 0.00 24.00 ± 0.00
Qwen/Qwen3.5-4B pp2048 @ d16384 (c2) 633.49 ± 3.76 446.22 ± 129.45 5012.26 ± 1453.92 5011.42 ± 1453.92 5012.29 ± 1453.90
Qwen/Qwen3.5-4B tg128 @ d16384 (c2) 30.01 ± 0.15 19.36 ± 3.49 48.00 ± 0.00 24.00 ± 0.00
Qwen/Qwen3.5-4B ctx_pp @ d32768 (c1) 4931.18 ± 88.90 4931.18 ± 88.90 6648.28 ± 120.66 6647.44 ± 120.66 6648.34 ± 120.65
Qwen/Qwen3.5-4B ctx_tg @ d32768 (c1) 18.14 ± 0.42 18.14 ± 0.42 19.00 ± 0.00 19.00 ± 0.00
Qwen/Qwen3.5-4B pp2048 @ d32768 (c1) 283.98 ± 7.49 283.98 ± 7.49 7217.72 ± 193.98 7216.88 ± 193.98 7217.79 ± 193.97
Qwen/Qwen3.5-4B tg128 @ d32768 (c1) 18.55 ± 0.15 18.55 ± 0.15 19.33 ± 0.47 19.33 ± 0.47
Qwen/Qwen3.5-4B ctx_pp @ d32768 (c2) 5002.33 ± 8.92 3667.01 ± 1168.00 9942.96 ± 3159.95 9942.13 ± 3159.95 9943.03 ± 3159.91
Qwen/Qwen3.5-4B ctx_tg @ d32768 (c2) 20.67 ± 0.25 16.18 ± 5.07 46.00 ± 0.00 23.00 ± 0.00
Qwen/Qwen3.5-4B pp2048 @ d32768 (c2) 289.92 ± 0.08 212.61 ± 67.76 10719.57 ± 3409.61 10718.73 ± 3409.61 10719.60 ± 3409.59
Qwen/Qwen3.5-4B tg128 @ d32768 (c2) 19.77 ± 0.23 15.85 ± 5.20 44.00 ± 0.00 22.00 ± 0.00
Qwen/Qwen3.5-4B ctx_pp @ d65535 (c1) 4030.86 ± 41.09 4030.86 ± 41.09 16261.09 ± 166.78 16260.25 ± 166.78 16261.17 ± 166.77
Qwen/Qwen3.5-4B ctx_tg @ d65535 (c1) 17.79 ± 0.52 17.79 ± 0.52 19.00 ± 0.82 19.00 ± 0.82
Qwen/Qwen3.5-4B pp2048 @ d65535 (c1) 121.40 ± 7.50 121.40 ± 7.50 16936.73 ± 1060.26 16935.89 ± 1060.26 16936.80 ± 1060.26
Qwen/Qwen3.5-4B tg128 @ d65535 (c1) 17.30 ± 1.38 17.30 ± 1.38 18.67 ± 1.25 18.67 ± 1.25
Qwen/Qwen3.5-4B ctx_pp @ d65535 (c2) 4082.90 ± 215.49 3020.06 ± 1008.23 24349.50 ± 8000.46 24348.67 ± 8000.46 24349.53 ± 8000.44
Qwen/Qwen3.5-4B ctx_tg @ d65535 (c2) 11.46 ± 0.19 12.91 ± 6.68 42.67 ± 0.94 21.33 ± 0.47
Qwen/Qwen3.5-4B pp2048 @ d65535 (c2) 125.92 ± 2.01 93.31 ± 30.39 24552.30 ± 7993.68 24551.46 ± 7993.68 24552.35 ± 7993.65
Qwen/Qwen3.5-4B tg128 @ d65535 (c2) 11.32 ± 0.05 12.90 ± 6.74 43.33 ± 0.94 21.67 ± 0.47

I ran Intel autoround against Qwen3.5-9B (on the spark itself) and it completed in a little over an hour. (Tuned not RTN)

I’m running a single rep of gpqa against it now to make sure it’s not braindead. But seemingly the most recent autoround commits provide qwen3.5 support without any additional tweaking.

I am running MMLU on the qwen 4b fp16 as well to check

Qwen 3.5 4B MMLU on DGX SPARK


Groups
Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.7451 ± 0.0035
- humanities 2 none 0 acc 0.6618 ± 0.0066
- other 2 none 0 acc 0.7811 ± 0.0071
- social sciences 2 none 0 acc 0.8372 ± 0.0066
- stem 2 none 0 acc 0.7441 ± 0.0075

Steps to install autoround on the spark and use it to quantize a model to int4 (I can at least do it in a reasonable amount of time on a 9B dense model, not sure about any other size, ymmv)

#uv install not covered, i don't like that they encourage piping curl to bash

uv venv --python 3.12 --seed .venv/autoround
source .venv/autoround/bin/activate

#install these from wheels
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

#install from source (these are going to take a bit)
uv pip install git+https://github.com/intel/auto-round.git
uv pip install -U git+https://github.com/fla-org/flash-linear-attention --no-build-isolation
uv pip install causal-conv1d --no-build-isolation

#ignored layers shared expert doesn't do anything on the dense models, but leaving it here since it was in the intel docs for the MoE version
auto-round "Qwen/Qwen3.5-9B" --output_dir ./qwen35-9b/ --ignore_layers shared_expert

GPQA scores align with what Qwen has posted, but token efficiency leaves something to be desired (the fastest token is one you don’t need to generate) for comparison, Qwen35b-A3 only needs ~ 1.7 million tokens for a single run.

=== GPQA Diamond ===
base_url:                 http://spark:8000/v1
model:                    /models/Qwen3.5-9B-w4g128
questions:                198
repeats:                  1
total eval calls:         198
score (all repeats):      0.8283 (82.83%)
correct / total:          164 / 198
failed requests:          0
prompt tokens total:      51,878
completion tokens total:  4,473,241
reasoning tokens total:   0
total tokens:             4,525,119
avg tokens / call:        22854.1
wall time (s):            14575.5

Autoround is honestly magic.

Quick thoughts on AutoRound:

Default tuning is 200 iterations (iter). Given it’s pretty efficient on Spark, we could increase that in many cases; they say 1000 is higher quality.

Did you track memory usage? Apparently the default batch size is 8.

I’m curious if anyone has experimented with the calibration dataset, which by default is “NeelNanda/pile-10k” and also has knobs for number of samples and sequence length.

It can also output NVFP4 in llm_compressor format, which may be relevant to compare quality of quants once backend support improves.

From intel’s docs in autoround, it looks like there’s nothing to gain beyond the defaults except 4x the time to quantize the model

Not sure about the calibration dataset, I do know nemotron 3 nano/super dataset is supposedly very solid but no clue on what that means for additional accuracy beyond what’s already here.

Peak memory on the spark total was 38G split 22 cpu/16 gpu.

My worry is that the nvfp4 optimization is only going to be realized once we get nvfp4 kv cache quantization as currently fp8 is the standard.

I’m not sure if there’s any performance boost to be gained on the nvfp4 format otherwise but I would HAPPILY be told otherwise by an nvidia rep that has a functioning build of flashinfer/dsl/cutlass/vllm/pytorch nvfp4 humming along.