PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

Compiling thanks

Here I go testing again:

fix: reduce smem allocation for tinygemm2 kernel in SM120 by jimmyzho Ā· Pull Request #2670 Ā· flashinfer-ai/flashinfer

please don’t get my hopes up again jimmy, I’ve been crashing on nvfp4 for seemingly a hundred years

I’ll report back in the morning if it crashed overnight.

Update: still crashing, found this to also track the nvfp4 flashinfer crashing:

[Bug]: Qwen3.5 NVFP4 models crash on ARM64 GB10 DGX Spark (CUDA illegal instruction during generation) Ā· Issue #35519 Ā· vllm-project/vllm

Hi everyone,

I am currently benchmarking a Dual DGX Spark cluster using the amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 model with vLLM.

Despite the high-end hardware, I am experiencing very low performance, averaging only about 1 token per second (t/s). I suspect there is a bottleneck in my configuration or the multi-node setup.

Below is the recipe and configuration I am using:

Configuration Details:

  • Model: amd/Qwen3-235B-A22B-Instruct-2507-MXFP4

  • Quantization: MXFP4

  • Backend: vLLM with FlashInfer

  • Tensor Parallelism (TP): 2

  • Hardware: Dual DGX Spark (connected via ConnectX-7 200Gb/s)

Recipe:

recipe_version: '1'
name: Qwen3 235B A22B Instruct 2507 MXFP4
description: vLLM serving amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 with MXFP4 quantization and FlashInfer
model: amd/Qwen3-235B-A22B-Instruct-2507-MXFP4
container: vllm-node-mxfp4
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.7
  max_num_batched_tokens: 8192
env:
  VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: '1'
command: |
  vllm serve amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 \
  --tool-call-parser openai \
  --enable-auto-tool-choice \
  --tensor-parallel-size {tensor_parallel} \
  --distributed-executor-backend ray \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --attention-backend FLASHINFER \
  --kv-cache-dtype fp8 \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --host {host} \
  --port {port}

Questions:

  1. With a dual-node setup connected via a single 200Gb/s link, is 1 t/s expected for a model of this size (235B) when using tensor_parallel: 2?

  2. Are there any specific vLLM flags or environment variables I should tune to minimize inter-node communication latency for this specific single-cable interconnect?

  3. Given the hardware constraints (2 nodes, 200Gb/s interconnect), are there other high-parameter models that are known to be better optimized for this type of multi-node distribution?

  4. Would adjusting max_num_batched_tokens or other memory-related settings help improve the throughput without changing the tensor parallel size?

Any guidance on how to optimize this for better performance would be greatly appreciated!

Plus this is my result

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 pp2048 1207.23 ± 68.86 3694.78 ± 101.16 1702.21 ± 101.16 3694.83 ± 101.16
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 tg32 0.63 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_pp @ d4096 1338.31 ± 4.75 5053.18 ± 10.89 3060.60 ± 10.89 5053.21 ± 10.89
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_tg @ d4096 0.63 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 pp2048 @ d4096 1108.37 ± 5.21 3840.37 ± 8.66 1847.80 ± 8.66 3840.40 ± 8.65
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 tg32 @ d4096 0.63 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_pp @ d8192 1356.39 ± 4.06 8032.20 ± 18.08 6039.62 ± 18.08 8032.24 ± 18.07
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_tg @ d8192 0.63 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 pp2048 @ d8192 1014.40 ± 12.06 4011.79 ± 24.03 2019.22 ± 24.03 4011.83 ± 24.03
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 tg32 @ d8192 0.63 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_pp @ d16384 1043.30 ± 1.05 17696.62 ± 15.86 15704.04 ± 15.86 17696.66 ± 15.85
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_tg @ d16384 0.62 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 pp2048 @ d16384 817.49 ± 11.89 4498.33 ± 36.80 2505.76 ± 36.80 4498.37 ± 36.81
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 tg32 @ d16384 0.62 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_pp @ d32768 808.63 ± 0.54 42515.50 ± 26.97 40522.93 ± 26.97 42515.54 ± 26.97
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_tg @ d32768 0.62 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 pp2048 @ d32768 603.59 ± 3.46 5385.73 ± 19.47 3393.16 ± 19.47 5385.77 ± 19.47
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 tg32 @ d32768 0.62 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_pp @ d65535 590.79 ± 2.29 112921.29 ± 430.42 110928.72 ± 430.42 112921.34 ± 430.42
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_tg @ d65535 0.62 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 pp2048 @ d65535 396.64 ± 2.60 7156.22 ± 33.75 5163.65 ± 33.75 7156.26 ± 33.75
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 tg32 @ d65535 0.62 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_pp @ d100000 463.92 ± 0.85 217546.68 ± 397.46 215554.11 ± 397.46 217546.72 ± 397.46
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 ctx_tg @ d100000 0.61 ± 0.00 1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 pp2048 @ d100000 288.37 ± 2.29 9094.98 ± 56.62 7102.40 ± 56.62 9095.02 ± 56.63
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 tg32 @ d100000 0.61 ± 0.00 1.00 ± 0.00

MXFP4 container has been tuned for gpt-oss-120b only and is not guaranteed to work with other models. I would recommend AWQ or INT4-Autoround (if that quant exists).

I haven’t tested NVFP4 with the most recent build yet - there are some improvements in flashinfer and vllm, so it may be a viable candidate too.

With QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ I was getting 26 t/s on dual Sparks.

1 Like

Thank you so much for the clarification! That explains why I was struggling with MXFP4. 26 t/s is impressive—I will try the QuantTrio AWQ model and update the vLLM build as you suggested. Appreciate the help!

1 Like

Hey Eugr, thanks again for the previous tips!

I’m now trying to run Qwen3.5 397B (MoE) using the AWQ version. However, I’m running into version mismatch issues between vLLM and Transformers in my current environment.

Should I attempt to manually rebuild/overwrite the Dockerfile with the latest versions of vLLM and Transformers? Or do you have a recommendation for a more stable quantization format or a specific vLLM build/branch that is better optimized for this 397B scale on a dual-node setup?

I’d love to hear your thoughts on the best stack to benchmark this beast. Thanks!

You need to run ./build-and-copy.sh with the –tf5 flag. You should also use the -t argument to give it a tag

2 Likes

Hi again! Thanks for the tip on ./build-and-copy.sh --tf5. It worked perfectly, and I can now load the model.

However, I’ve hit a new bottleneck: OOM (Out of Memory) during the cache_block allocation phase for the Qwen 3.5 397B AWQ model.

My setup is a Dual DGX Spark (256GB total VRAM). With the model weights taking up a huge chunk of memory, there isn’t much left for the KV Cache.

What is the best way to optimize the memory footprint for this 397B scale on 2 nodes? Should I lower the gpu_memory_utilization, or is there a specific max_model_len or --kv-cache-dtype fp8 setting you recommend to fit this into 256GB VRAM? Thanks for your support!"

Check out my thread on the OOM and Qwen3-5-397B - I had been fighting this for last 3 days.

1 Like

Happy to help. I’m not sure what else you are running on the boxes, but here is what I used to start the model on my units:

./spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5
–apply-mod ./spark-vllm-docker/mods/fix-qwen3.5-autoround
–apply-mod ./spark-vllm-docker/mods/fix-qwen3.5-chat-template
-e VLLM_MARLIN_USE_ATOMIC_ADD=1
exec vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound
–chat-template unsloth.jinja
–max-model-len 128000
–gpu-memory-utilization 0.85
–port 8555
–host 0.0.0.0
–kv_cache_dtype fp8
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–reasoning-parser qwen3
–enable-prefix-caching
–max-num-batched-tokens 8192
–trust-remote-code
-tp 2
–distributed-executor-backend ray

A couple things to try:
Set -kv_cache_dtype fp8

Try a lower max-model-len and work your way up – I know I can start with 128K tokens with Open WebUI and a couple of MCP servers running but not much else.

Don’t use –load-format fastsafetensors for this gpu memory utilization. Good luck.

3 Likes