Compiling thanks
Here I go testing again:
please donāt get my hopes up again jimmy, Iāve been crashing on nvfp4 for seemingly a hundred years
Iāll report back in the morning if it crashed overnight.
Update: still crashing, found this to also track the nvfp4 flashinfer crashing:
Hi everyone,
I am currently benchmarking a Dual DGX Spark cluster using the amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 model with vLLM.
Despite the high-end hardware, I am experiencing very low performance, averaging only about 1 token per second (t/s). I suspect there is a bottleneck in my configuration or the multi-node setup.
Below is the recipe and configuration I am using:
Configuration Details:
-
Model:
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 -
Quantization: MXFP4
-
Backend: vLLM with FlashInfer
-
Tensor Parallelism (TP): 2
-
Hardware: Dual DGX Spark (connected via ConnectX-7 200Gb/s)
Recipe:
recipe_version: '1'
name: Qwen3 235B A22B Instruct 2507 MXFP4
description: vLLM serving amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 with MXFP4 quantization and FlashInfer
model: amd/Qwen3-235B-A22B-Instruct-2507-MXFP4
container: vllm-node-mxfp4
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.7
max_num_batched_tokens: 8192
env:
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: '1'
command: |
vllm serve amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tensor-parallel-size {tensor_parallel} \
--distributed-executor-backend ray \
--gpu-memory-utilization {gpu_memory_utilization} \
--enable-prefix-caching \
--load-format fastsafetensors \
--attention-backend FLASHINFER \
--kv-cache-dtype fp8 \
--max-num-batched-tokens {max_num_batched_tokens} \
--host {host} \
--port {port}
Questions:
-
With a dual-node setup connected via a single 200Gb/s link, is 1 t/s expected for a model of this size (235B) when using
tensor_parallel: 2? -
Are there any specific vLLM flags or environment variables I should tune to minimize inter-node communication latency for this specific single-cable interconnect?
-
Given the hardware constraints (2 nodes, 200Gb/s interconnect), are there other high-parameter models that are known to be better optimized for this type of multi-node distribution?
-
Would adjusting
max_num_batched_tokensor other memory-related settings help improve the throughput without changing the tensor parallel size?
Any guidance on how to optimize this for better performance would be greatly appreciated!
Plus this is my result
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 | 1207.23 ± 68.86 | 3694.78 ± 101.16 | 1702.21 ± 101.16 | 3694.83 ± 101.16 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d4096 | 1338.31 ± 4.75 | 5053.18 ± 10.89 | 3060.60 ± 10.89 | 5053.21 ± 10.89 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d4096 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d4096 | 1108.37 ± 5.21 | 3840.37 ± 8.66 | 1847.80 ± 8.66 | 3840.40 ± 8.65 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d4096 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d8192 | 1356.39 ± 4.06 | 8032.20 ± 18.08 | 6039.62 ± 18.08 | 8032.24 ± 18.07 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d8192 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d8192 | 1014.40 ± 12.06 | 4011.79 ± 24.03 | 2019.22 ± 24.03 | 4011.83 ± 24.03 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d8192 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d16384 | 1043.30 ± 1.05 | 17696.62 ± 15.86 | 15704.04 ± 15.86 | 17696.66 ± 15.85 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d16384 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d16384 | 817.49 ± 11.89 | 4498.33 ± 36.80 | 2505.76 ± 36.80 | 4498.37 ± 36.81 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d16384 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d32768 | 808.63 ± 0.54 | 42515.50 ± 26.97 | 40522.93 ± 26.97 | 42515.54 ± 26.97 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d32768 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d32768 | 603.59 ± 3.46 | 5385.73 ± 19.47 | 3393.16 ± 19.47 | 5385.77 ± 19.47 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d32768 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d65535 | 590.79 ± 2.29 | 112921.29 ± 430.42 | 110928.72 ± 430.42 | 112921.34 ± 430.42 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d65535 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d65535 | 396.64 ± 2.60 | 7156.22 ± 33.75 | 5163.65 ± 33.75 | 7156.26 ± 33.75 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d65535 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d100000 | 463.92 ± 0.85 | 217546.68 ± 397.46 | 215554.11 ± 397.46 | 217546.72 ± 397.46 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d100000 | 0.61 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d100000 | 288.37 ± 2.29 | 9094.98 ± 56.62 | 7102.40 ± 56.62 | 9095.02 ± 56.63 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d100000 | 0.61 ± 0.00 | 1.00 ± 0.00 |
MXFP4 container has been tuned for gpt-oss-120b only and is not guaranteed to work with other models. I would recommend AWQ or INT4-Autoround (if that quant exists).
I havenāt tested NVFP4 with the most recent build yet - there are some improvements in flashinfer and vllm, so it may be a viable candidate too.
With QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ I was getting 26 t/s on dual Sparks.
Thank you so much for the clarification! That explains why I was struggling with MXFP4. 26 t/s is impressiveāI will try the QuantTrio AWQ model and update the vLLM build as you suggested. Appreciate the help!
Hey Eugr, thanks again for the previous tips!
Iām now trying to run Qwen3.5 397B (MoE) using the AWQ version. However, Iām running into version mismatch issues between vLLM and Transformers in my current environment.
Should I attempt to manually rebuild/overwrite the Dockerfile with the latest versions of vLLM and Transformers? Or do you have a recommendation for a more stable quantization format or a specific vLLM build/branch that is better optimized for this 397B scale on a dual-node setup?
Iād love to hear your thoughts on the best stack to benchmark this beast. Thanks!
You need to run ./build-and-copy.sh with the ātf5 flag. You should also use the -t argument to give it a tag
Hi again! Thanks for the tip on ./build-and-copy.sh --tf5. It worked perfectly, and I can now load the model.
However, Iāve hit a new bottleneck: OOM (Out of Memory) during the cache_block allocation phase for the Qwen 3.5 397B AWQ model.
My setup is a Dual DGX Spark (256GB total VRAM). With the model weights taking up a huge chunk of memory, there isnāt much left for the KV Cache.
What is the best way to optimize the memory footprint for this 397B scale on 2 nodes? Should I lower the gpu_memory_utilization, or is there a specific max_model_len or --kv-cache-dtype fp8 setting you recommend to fit this into 256GB VRAM? Thanks for your support!"
Check out my thread on the OOM and Qwen3-5-397B - I had been fighting this for last 3 days.
Happy to help. Iām not sure what else you are running on the boxes, but here is what I used to start the model on my units:
./spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5
āapply-mod ./spark-vllm-docker/mods/fix-qwen3.5-autoround
āapply-mod ./spark-vllm-docker/mods/fix-qwen3.5-chat-template
-e VLLM_MARLIN_USE_ATOMIC_ADD=1
exec vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound
āchat-template unsloth.jinja
āmax-model-len 128000
āgpu-memory-utilization 0.85
āport 8555
āhost 0.0.0.0
ākv_cache_dtype fp8
āenable-auto-tool-choice
ātool-call-parser qwen3_coder
āreasoning-parser qwen3
āenable-prefix-caching
āmax-num-batched-tokens 8192
ātrust-remote-code
-tp 2
ādistributed-executor-backend ray
A couple things to try:
Set -kv_cache_dtype fp8
Try a lower max-model-len and work your way up ā I know I can start with 128K tokens with Open WebUI and a couple of MCP servers running but not much else.
Donāt use āload-format fastsafetensors for this gpu memory utilization. Good luck.