Compiling thanks
Here I go testing again:
please donāt get my hopes up again jimmy, Iāve been crashing on nvfp4 for seemingly a hundred years
Iāll report back in the morning if it crashed overnight.
Update: still crashing, found this to also track the nvfp4 flashinfer crashing:
Hi everyone,
I am currently benchmarking a Dual DGX Spark cluster using the amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 model with vLLM.
Despite the high-end hardware, I am experiencing very low performance, averaging only about 1 token per second (t/s). I suspect there is a bottleneck in my configuration or the multi-node setup.
Below is the recipe and configuration I am using:
Configuration Details:
-
Model:
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 -
Quantization: MXFP4
-
Backend: vLLM with FlashInfer
-
Tensor Parallelism (TP): 2
-
Hardware: Dual DGX Spark (connected via ConnectX-7 200Gb/s)
Recipe:
recipe_version: '1'
name: Qwen3 235B A22B Instruct 2507 MXFP4
description: vLLM serving amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 with MXFP4 quantization and FlashInfer
model: amd/Qwen3-235B-A22B-Instruct-2507-MXFP4
container: vllm-node-mxfp4
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.7
max_num_batched_tokens: 8192
env:
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: '1'
command: |
vllm serve amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tensor-parallel-size {tensor_parallel} \
--distributed-executor-backend ray \
--gpu-memory-utilization {gpu_memory_utilization} \
--enable-prefix-caching \
--load-format fastsafetensors \
--attention-backend FLASHINFER \
--kv-cache-dtype fp8 \
--max-num-batched-tokens {max_num_batched_tokens} \
--host {host} \
--port {port}
Questions:
-
With a dual-node setup connected via a single 200Gb/s link, is 1 t/s expected for a model of this size (235B) when using
tensor_parallel: 2? -
Are there any specific vLLM flags or environment variables I should tune to minimize inter-node communication latency for this specific single-cable interconnect?
-
Given the hardware constraints (2 nodes, 200Gb/s interconnect), are there other high-parameter models that are known to be better optimized for this type of multi-node distribution?
-
Would adjusting
max_num_batched_tokensor other memory-related settings help improve the throughput without changing the tensor parallel size?
Any guidance on how to optimize this for better performance would be greatly appreciated!
Plus this is my result
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 | 1207.23 ± 68.86 | 3694.78 ± 101.16 | 1702.21 ± 101.16 | 3694.83 ± 101.16 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d4096 | 1338.31 ± 4.75 | 5053.18 ± 10.89 | 3060.60 ± 10.89 | 5053.21 ± 10.89 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d4096 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d4096 | 1108.37 ± 5.21 | 3840.37 ± 8.66 | 1847.80 ± 8.66 | 3840.40 ± 8.65 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d4096 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d8192 | 1356.39 ± 4.06 | 8032.20 ± 18.08 | 6039.62 ± 18.08 | 8032.24 ± 18.07 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d8192 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d8192 | 1014.40 ± 12.06 | 4011.79 ± 24.03 | 2019.22 ± 24.03 | 4011.83 ± 24.03 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d8192 | 0.63 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d16384 | 1043.30 ± 1.05 | 17696.62 ± 15.86 | 15704.04 ± 15.86 | 17696.66 ± 15.85 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d16384 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d16384 | 817.49 ± 11.89 | 4498.33 ± 36.80 | 2505.76 ± 36.80 | 4498.37 ± 36.81 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d16384 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d32768 | 808.63 ± 0.54 | 42515.50 ± 26.97 | 40522.93 ± 26.97 | 42515.54 ± 26.97 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d32768 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d32768 | 603.59 ± 3.46 | 5385.73 ± 19.47 | 3393.16 ± 19.47 | 5385.77 ± 19.47 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d32768 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d65535 | 590.79 ± 2.29 | 112921.29 ± 430.42 | 110928.72 ± 430.42 | 112921.34 ± 430.42 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d65535 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d65535 | 396.64 ± 2.60 | 7156.22 ± 33.75 | 5163.65 ± 33.75 | 7156.26 ± 33.75 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d65535 | 0.62 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_pp @ d100000 | 463.92 ± 0.85 | 217546.68 ± 397.46 | 215554.11 ± 397.46 | 217546.72 ± 397.46 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | ctx_tg @ d100000 | 0.61 ± 0.00 | 1.00 ± 0.00 | |||
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | pp2048 @ d100000 | 288.37 ± 2.29 | 9094.98 ± 56.62 | 7102.40 ± 56.62 | 9095.02 ± 56.63 | |
| amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 | tg32 @ d100000 | 0.61 ± 0.00 | 1.00 ± 0.00 |
MXFP4 container has been tuned for gpt-oss-120b only and is not guaranteed to work with other models. I would recommend AWQ or INT4-Autoround (if that quant exists).
I havenāt tested NVFP4 with the most recent build yet - there are some improvements in flashinfer and vllm, so it may be a viable candidate too.
With QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ I was getting 26 t/s on dual Sparks.
Thank you so much for the clarification! That explains why I was struggling with MXFP4. 26 t/s is impressiveāI will try the QuantTrio AWQ model and update the vLLM build as you suggested. Appreciate the help!
Hey Eugr, thanks again for the previous tips!
Iām now trying to run Qwen3.5 397B (MoE) using the AWQ version. However, Iām running into version mismatch issues between vLLM and Transformers in my current environment.
Should I attempt to manually rebuild/overwrite the Dockerfile with the latest versions of vLLM and Transformers? Or do you have a recommendation for a more stable quantization format or a specific vLLM build/branch that is better optimized for this 397B scale on a dual-node setup?
Iād love to hear your thoughts on the best stack to benchmark this beast. Thanks!
You need to run ./build-and-copy.sh with the ātf5 flag. You should also use the -t argument to give it a tag
Hi again! Thanks for the tip on ./build-and-copy.sh --tf5. It worked perfectly, and I can now load the model.
However, Iāve hit a new bottleneck: OOM (Out of Memory) during the cache_block allocation phase for the Qwen 3.5 397B AWQ model.
My setup is a Dual DGX Spark (256GB total VRAM). With the model weights taking up a huge chunk of memory, there isnāt much left for the KV Cache.
What is the best way to optimize the memory footprint for this 397B scale on 2 nodes? Should I lower the gpu_memory_utilization, or is there a specific max_model_len or --kv-cache-dtype fp8 setting you recommend to fit this into 256GB VRAM? Thanks for your support!"
Check out my thread on the OOM and Qwen3-5-397B - I had been fighting this for last 3 days.
Happy to help. Iām not sure what else you are running on the boxes, but here is what I used to start the model on my units:
./spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5
āapply-mod ./spark-vllm-docker/mods/fix-qwen3.5-autoround
āapply-mod ./spark-vllm-docker/mods/fix-qwen3.5-chat-template
-e VLLM_MARLIN_USE_ATOMIC_ADD=1
exec vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound
āchat-template unsloth.jinja
āmax-model-len 128000
āgpu-memory-utilization 0.85
āport 8555
āhost 0.0.0.0
ākv_cache_dtype fp8
āenable-auto-tool-choice
ātool-call-parser qwen3_coder
āreasoning-parser qwen3
āenable-prefix-caching
āmax-num-batched-tokens 8192
ātrust-remote-code
-tp 2
ādistributed-executor-backend ray
A couple things to try:
Set -kv_cache_dtype fp8
Try a lower max-model-len and work your way up ā I know I can start with 128K tokens with Open WebUI and a couple of MCP servers running but not much else.
Donāt use āload-format fastsafetensors for this gpu memory utilization. Good luck.
A few more to watch, these are pretty exciting for a potentially large boost to decode/memory bandwidth
Now if nvfp4 through flashinfer/cutlass would just stop crashing weād have a tangible improvement instead of more community ābreakthroughsā
While this would definitely bring a huge boost, I would be hesitant to go below fp8 in KV cache as it is usually much more sensitive to quantization than the model weights.
What made you think it wouldnāt OOM on your 2 Sparks? QuantTrio/Qwen3.5-397B-A17B-AWQ is 244GB according to the HF repo page. Even if that left enough of the 256GB total memory for both OSes and VLLM (very close), it would leave nothing for context size, prompt cache, etc. Use the 199GB model josephbreda linked.
Update :
Nvfp4 marlin backend : working
Nvfp4 vllm Cutlass : working
Nvfp4 flashinfer_cutlass : illegal exceptions for days
For those that have started running for nvfp4 models.
Just a heads up this pr that was just merged currently needs the GPU arch set to 12.0f to compile without āno nvfp4 kernelā runtime errors.
Waiting for mgoin to complete his rip and tear through GitHub before I try my flashinfer_cutlass stress test again.
This will have been the 4th time Iāve gotten my hopes up
Thanks, Iāve just ran my nightly build, will rerun again I guess :)
Hmm⦠It is supposed to work correctly when built with 12.1a arch, but gives me āNotImplementedError: No compiled nvfp4 quantization kernelā.
EDIT: I guess I misread your post - you were warning exactly about this (facepalm).
My guess is the crashing/illegal instructions persist but Iām willing to attempt again. For all I know there are multiple bugs landing in the same result judging the amount of complexity in the kernel feature/selection process across Cutlass/flashinfer/vllm/flashattn etc
But it is nice to see them make progress on actually adding a spark into their ci/cd.
OK, Johnny submitted a PR that should fix it, compiling now: [NVIDIA] Fix DGX Spark logic by johnnynunez Ā· Pull Request #38126 Ā· vllm-project/vllm Ā· GitHub
This should have all been mainlined before the spark was even available to purchase.
Sad state to see the latest Intel GPUs land with day 0 support when weāre a year later in spark PR hell