Qwen3.5-397B-A17B + DGX Spark (duo)

It is not my PR, but I appreciate you highlighting it for me. I am using the community build --tf5 --rebuild-flashinfer --rebuild-vllm --vllm-ref "${VLLM_SHA}" (nightly), but it has been very difficult to get nvidia/Qwen3.5-397B-A17B-NVFP4 or lukealonso/GLM-5-NVFP4 working. I can get vLLM serving nvidia/Qwen3.5-397B-A17B-NVFP4 but run into CUDA kernel faults midway through generation (after a couple of tool calls).

I think I am going to give Intel/GLM-5-int4-mixed-AutoRound a shot next.

Yeah, NVFP4 is hit and miss on Spark currently. Looks like autoround quants took the crown from AWQ though :)

2 Likes

That’s the first I’ve heard of this. More context for those interested: Introducing AutoRound: Intel’s Advanced Quantization for LLMs and VLMs

1 Like

Getting a specific crash at - causal_conv1d_update assertion (num_cache_lines >= batch) during CUDA graph capture.

Using these flags:

–apply-mod mods/fix-qwen3.5-autoround
-e VLLM_MARLIN_USE_ATOMIC_ADD=1
exec vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound
–max-model-len auto
–gpu-memory-utilization 0.85
–port 8000
–host 0.0.0.0
-tp 2
–distributed-executor-backend ray
–load-format fastsafetensors
–enable-prefix-caching
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–reasoning-parser qwen3
–max-num-batched-tokens 8192
–trust-remote-code

not sure how the other dude got it running. using latest builds and tf5 image

1 Like

I was able to get Qwen/Qwen3.5-397B-A17B-FP8 running with the following using @eugr’s nightly vLLM +tf5 build and copy:

# Start container
./launch-cluster.sh -d start \
  --nodes "$SPARK_NODES" \
  --name vllm_node \
  -t "$VLLM_IMAGE_TAG" \
  --eth-if "$FABRIC_IF" \
  --ib-if "$IB_IF"

# Patch TF5 RoPE bug
for ip in ${SPARK_NODES//,/ }; do
  echo "== Patching on $ip =="
  ssh -o BatchMode=yes "$SPARK_USER@$ip" "docker exec vllm_node bash -lc '
set -euo pipefail
FILE=\"/usr/local/lib/python3.12/dist-packages/transformers/modeling_rope_utils.py\"
test -f \"\$FILE\"
sed -i \"s/ignore_keys_at_rope_validation = ignore_keys_at_rope_validation | {/ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {/g\" \"\$FILE\"
grep -n \"set(ignore_keys_at_rope_validation) |\" \"\$FILE\" | head -n 2 || true
echo \"OK: patched \$FILE\"
'"
done

# Start vLLM on head node
./launch-cluster.sh \
  --nodes "$SPARK_NODES" \
  --name vllm_node \
  -t "$VLLM_IMAGE_TAG" \
  exec "bash -lc '
set -euo pipefail

# NCCL/RDMA bindings
export NCCL_SOCKET_IFNAME="${FABRIC_IF}"
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA="${IB_IF}"
export NCCL_IB_GID_INDEX=${IB_GID_INDEX}

# MPI/UCX bindings
export OMPI_MCA_btl_tcp_if_include="${FABRIC_IF}"
export OMPI_MCA_oob_tcp_if_include="${FABRIC_IF}"
export UCX_NET_DEVICES="${FABRIC_IF}"

# vLLM optimizations
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

vllm serve \"\$MODEL\" \
  --served-model-name "Qwen/Qwen3.5-397B-A17B-FP8" \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.85 \
  --load-format fastsafetensors \
  --attention-backend flashinfer \
  --enable-expert-parallel \
  --tensor-parallel-size 4 \
  --max-num-seqs 32 \
  --compilation-config.cudagraph_mode none \
  --trust-remote-code \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --max-num-batched-tokens 8192 \
  --max-model-len auto \
  --enable-auto-tool-choice \
  --mm-encoder-tp-mode data \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --host 0.0.0.0 \
  --port 8000
'"

I did a quick test query that averaged ~18 tg after four successful, sequential tool calls. Will run llama-benchy tomorrow.

Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 30.6%.
2 Likes

Thanks for trying that out

1 Like

This patch is a part of my repo already, you can greatly simplify your launch by just using:

./launch-cluster.sh -t vllm-node-tf5 --apply-mod mods/fix-qwen3.5-autoround exec vllm ....
3 Likes

I recently ran the released Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 on 4 DGXs and posted the command line and benchmark results.

nohup ./launch-cluster.sh \
  -t vllm-node-tf5 \
  --apply-mod mods/fix-qwen3.5-autoround \
  --nodes "169.254.71.59,169.254.93.49,169.254.100.145,169.254.46.240" \
  exec vllm serve \
  Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 \
  --port 8000 --host 0.0.0.0 \
  --gpu-memory-utilization 0.8 \
  --tensor-parallel-size 4 \
  --distributed-executor-backend ray \
  --attention-backend flashinfer \
  --enable-auto-tool-choice \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --max-model-len auto \
  --chat-template /root/chat-templates/qwen3.5-openclaw-fixed-chat-template.jinja \
  --load-format fastsafetensors \
  --mm-encoder-tp-mode data \
  --enable-prefix-caching \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 32
model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 pp2048 1634.75 ± 8.12 1255.11 ± 6.22 1253.43 ± 6.22 1255.16 ± 6.23
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 tg32 24.52 ± 0.02 25.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 pp2048 @ d4096 2540.64 ± 4.46 2420.50 ± 4.06 2418.82 ± 4.06 2420.55 ± 4.06
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 tg32 @ d4096 24.30 ± 0.08 25.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 pp2048 @ d8192 2659.70 ± 2.16 3851.86 ± 2.97 3850.18 ± 2.97 3851.91 ± 2.95
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 tg32 @ d8192 24.05 ± 0.11 25.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 pp2048 @ d16384 2737.07 ± 4.33 6736.14 ± 10.51 6734.46 ± 10.51 6736.20 ± 10.52
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 tg32 @ d16384 23.75 ± 0.04 24.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 pp2048 @ d65536 2513.67 ± 6.63 26888.88 ± 71.05 26887.20 ± 71.05 26888.95 ± 71.04
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 tg32 @ d65536 22.79 ± 0.05 24.00 ± 0.00
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 pp2048 @ d100000 2316.80 ± 27.04 44055.16 ± 515.58 44053.48 ± 515.58 44055.22 ± 515.56
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 tg32 @ d100000 21.78 ± 0.09 22.67 ± 0.47
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 pp2048 @ d200000 1929.04 ± 1.53 104742.64 ± 82.81 104740.96 ± 82.81 104742.72 ± 82.83
Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 tg32 @ d200000 19.33 ± 0.11 20.00 ± 0.00

llama-benchy (0.3.4)
date: 2026-03-05 13:10:53 | latency mode: api

I tried running the Intel version of Qwen3.5 397B on dual DGX Sparks: Intel/Qwen3.5-397B-A17B-int4-AutoRound · Hugging Face
Here’s what I’m seeing:

| model                            |   test |             t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------|-------:|----------------:|-------------:|---------------:|---------------:|----------------:|
| Qwen3.5-397B-A17B-int4-AutoRound | pp2048 | 1646.45 ± 11.40 |              | 1245.75 ± 8.64 | 1244.55 ± 8.64 |  1245.79 ± 8.65 |
| Qwen3.5-397B-A17B-int4-AutoRound |   tg32 |    24.94 ± 0.29 | 25.67 ± 0.47 |                |                |                 |

So, almost 26 t/s. Not bad.

1 Like