OllieJW
February 28, 2026, 1:05am
42
It is not my PR, but I appreciate you highlighting it for me. I am using the community build --tf5 --rebuild-flashinfer --rebuild-vllm --vllm-ref "${VLLM_SHA}" (nightly), but it has been very difficult to get nvidia/Qwen3.5-397B-A17B-NVFP4 or lukealonso/GLM-5-NVFP4 working. I can get vLLM serving nvidia/Qwen3.5-397B-A17B-NVFP4 but run into CUDA kernel faults midway through generation (after a couple of tool calls).
I think I am going to give Intel/GLM-5-int4-mixed-AutoRound a shot next.
eugr
February 28, 2026, 1:17am
43
Yeah, NVFP4 is hit and miss on Spark currently. Looks like autoround quants took the crown from AWQ though :)
2 Likes
That’s the first I’ve heard of this. More context for those interested: Introducing AutoRound: Intel’s Advanced Quantization for LLMs and VLMs
1 Like
Getting a specific crash at - causal_conv1d_update assertion (num_cache_lines >= batch) during CUDA graph capture.
Using these flags:
–apply-mod mods/fix-qwen3.5-autoround
-e VLLM_MARLIN_USE_ATOMIC_ADD=1
exec vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound
–max-model-len auto
–gpu-memory-utilization 0.85
–port 8000
–host 0.0.0.0
-tp 2
–distributed-executor-backend ray
–load-format fastsafetensors
–enable-prefix-caching
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–reasoning-parser qwen3
–max-num-batched-tokens 8192
–trust-remote-code
not sure how the other dude got it running. using latest builds and tf5 image
1 Like
I was able to get Qwen/Qwen3.5-397B-A17B-FP8 running with the following using @eugr ’s nightly vLLM +tf5 build and copy:
# Start container
./launch-cluster.sh -d start \
--nodes "$SPARK_NODES" \
--name vllm_node \
-t "$VLLM_IMAGE_TAG" \
--eth-if "$FABRIC_IF" \
--ib-if "$IB_IF"
# Patch TF5 RoPE bug
for ip in ${SPARK_NODES//,/ }; do
echo "== Patching on $ip =="
ssh -o BatchMode=yes "$SPARK_USER@$ip" "docker exec vllm_node bash -lc '
set -euo pipefail
FILE=\"/usr/local/lib/python3.12/dist-packages/transformers/modeling_rope_utils.py\"
test -f \"\$FILE\"
sed -i \"s/ignore_keys_at_rope_validation = ignore_keys_at_rope_validation | {/ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {/g\" \"\$FILE\"
grep -n \"set(ignore_keys_at_rope_validation) |\" \"\$FILE\" | head -n 2 || true
echo \"OK: patched \$FILE\"
'"
done
# Start vLLM on head node
./launch-cluster.sh \
--nodes "$SPARK_NODES" \
--name vllm_node \
-t "$VLLM_IMAGE_TAG" \
exec "bash -lc '
set -euo pipefail
# NCCL/RDMA bindings
export NCCL_SOCKET_IFNAME="${FABRIC_IF}"
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA="${IB_IF}"
export NCCL_IB_GID_INDEX=${IB_GID_INDEX}
# MPI/UCX bindings
export OMPI_MCA_btl_tcp_if_include="${FABRIC_IF}"
export OMPI_MCA_oob_tcp_if_include="${FABRIC_IF}"
export UCX_NET_DEVICES="${FABRIC_IF}"
# vLLM optimizations
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
vllm serve \"\$MODEL\" \
--served-model-name "Qwen/Qwen3.5-397B-A17B-FP8" \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.85 \
--load-format fastsafetensors \
--attention-backend flashinfer \
--enable-expert-parallel \
--tensor-parallel-size 4 \
--max-num-seqs 32 \
--compilation-config.cudagraph_mode none \
--trust-remote-code \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--max-num-batched-tokens 8192 \
--max-model-len auto \
--enable-auto-tool-choice \
--mm-encoder-tp-mode data \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--host 0.0.0.0 \
--port 8000
'"
I did a quick test query that averaged ~18 tg after four successful, sequential tool calls. Will run llama-benchy tomorrow.
Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 30.6%.
2 Likes
Thanks for trying that out
1 Like
eugr
March 2, 2026, 6:24am
48
This patch is a part of my repo already, you can greatly simplify your launch by just using:
./launch-cluster.sh -t vllm-node-tf5 --apply-mod mods/fix-qwen3.5-autoround exec vllm ....
3 Likes
s0ne
March 5, 2026, 4:26am
49
I recently ran the released Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 on 4 DGXs and posted the command line and benchmark results.
nohup ./launch-cluster.sh \
-t vllm-node-tf5 \
--apply-mod mods/fix-qwen3.5-autoround \
--nodes "169.254.71.59,169.254.93.49,169.254.100.145,169.254.46.240" \
exec vllm serve \
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 \
--port 8000 --host 0.0.0.0 \
--gpu-memory-utilization 0.8 \
--tensor-parallel-size 4 \
--distributed-executor-backend ray \
--attention-backend flashinfer \
--enable-auto-tool-choice \
--trust-remote-code \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--max-model-len auto \
--chat-template /root/chat-templates/qwen3.5-openclaw-fixed-chat-template.jinja \
--load-format fastsafetensors \
--mm-encoder-tp-mode data \
--enable-prefix-caching \
--max-num-batched-tokens 8192 \
--max-num-seqs 32
model
test
t/s
peak t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
pp2048
1634.75 ± 8.12
1255.11 ± 6.22
1253.43 ± 6.22
1255.16 ± 6.23
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
tg32
24.52 ± 0.02
25.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
pp2048 @ d4096
2540.64 ± 4.46
2420.50 ± 4.06
2418.82 ± 4.06
2420.55 ± 4.06
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
tg32 @ d4096
24.30 ± 0.08
25.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
pp2048 @ d8192
2659.70 ± 2.16
3851.86 ± 2.97
3850.18 ± 2.97
3851.91 ± 2.95
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
tg32 @ d8192
24.05 ± 0.11
25.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
pp2048 @ d16384
2737.07 ± 4.33
6736.14 ± 10.51
6734.46 ± 10.51
6736.20 ± 10.52
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
tg32 @ d16384
23.75 ± 0.04
24.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
pp2048 @ d65536
2513.67 ± 6.63
26888.88 ± 71.05
26887.20 ± 71.05
26888.95 ± 71.04
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
tg32 @ d65536
22.79 ± 0.05
24.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
pp2048 @ d100000
2316.80 ± 27.04
44055.16 ± 515.58
44053.48 ± 515.58
44055.22 ± 515.56
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
tg32 @ d100000
21.78 ± 0.09
22.67 ± 0.47
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
pp2048 @ d200000
1929.04 ± 1.53
104742.64 ± 82.81
104740.96 ± 82.81
104742.72 ± 82.83
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
tg32 @ d200000
19.33 ± 0.11
20.00 ± 0.00
llama-benchy (0.3.4)
date: 2026-03-05 13:10:53 | latency mode: api
I tried running the Intel version of Qwen3.5 397B on dual DGX Sparks: Intel/Qwen3.5-397B-A17B-int4-AutoRound · Hugging Face
Here’s what I’m seeing:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:---------------------------------|-------:|----------------:|-------------:|---------------:|---------------:|----------------:|
| Qwen3.5-397B-A17B-int4-AutoRound | pp2048 | 1646.45 ± 11.40 | | 1245.75 ± 8.64 | 1244.55 ± 8.64 | 1245.79 ± 8.65 |
| Qwen3.5-397B-A17B-int4-AutoRound | tg32 | 24.94 ± 0.29 | 25.67 ± 0.47 | | | |
So, almost 26 t/s. Not bad.
1 Like