PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

vedcsolution · March 4, 2026, 12:23pm

Compiling thanks

trystan1 · March 5, 2026, 2:21am

Here I go testing again:

fix: reduce smem allocation for tinygemm2 kernel in SM120 by jimmyzho · Pull Request #2670 · flashinfer-ai/flashinfer

please don’t get my hopes up again jimmy, I’ve been crashing on nvfp4 for seemingly a hundred years

I’ll report back in the morning if it crashed overnight.

Update: still crashing, found this to also track the nvfp4 flashinfer crashing:

[Bug]: Qwen3.5 NVFP4 models crash on ARM64 GB10 DGX Spark (CUDA illegal instruction during generation) · Issue #35519 · vllm-project/vllm

anlnh · March 5, 2026, 4:36am

Hi everyone,

I am currently benchmarking a Dual DGX Spark cluster using the amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 model with vLLM.

Despite the high-end hardware, I am experiencing very low performance, averaging only about 1 token per second (t/s). I suspect there is a bottleneck in my configuration or the multi-node setup.

Below is the recipe and configuration I am using:

Configuration Details:

Model: amd/Qwen3-235B-A22B-Instruct-2507-MXFP4
Quantization: MXFP4
Backend: vLLM with FlashInfer
Tensor Parallelism (TP): 2
Hardware: Dual DGX Spark (connected via ConnectX-7 200Gb/s)

Recipe:

recipe_version: '1'
name: Qwen3 235B A22B Instruct 2507 MXFP4
description: vLLM serving amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 with MXFP4 quantization and FlashInfer
model: amd/Qwen3-235B-A22B-Instruct-2507-MXFP4
container: vllm-node-mxfp4
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.7
  max_num_batched_tokens: 8192
env:
  VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: '1'
command: |
  vllm serve amd/Qwen3-235B-A22B-Instruct-2507-MXFP4 \
  --tool-call-parser openai \
  --enable-auto-tool-choice \
  --tensor-parallel-size {tensor_parallel} \
  --distributed-executor-backend ray \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --enable-prefix-caching \
  --load-format fastsafetensors \
  --attention-backend FLASHINFER \
  --kv-cache-dtype fp8 \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --host {host} \
  --port {port}

Questions:

With a dual-node setup connected via a single 200Gb/s link, is 1 t/s expected for a model of this size (235B) when using tensor_parallel: 2?
Are there any specific vLLM flags or environment variables I should tune to minimize inter-node communication latency for this specific single-cable interconnect?
Given the hardware constraints (2 nodes, 200Gb/s interconnect), are there other high-parameter models that are known to be better optimized for this type of multi-node distribution?
Would adjusting max_num_batched_tokens or other memory-related settings help improve the throughput without changing the tensor parallel size?

Any guidance on how to optimize this for better performance would be greatly appreciated!

Plus this is my result

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	pp2048	1207.23 ± 68.86		3694.78 ± 101.16	1702.21 ± 101.16	3694.83 ± 101.16
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	tg32	0.63 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_pp @ d4096	1338.31 ± 4.75		5053.18 ± 10.89	3060.60 ± 10.89	5053.21 ± 10.89
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_tg @ d4096	0.63 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	pp2048 @ d4096	1108.37 ± 5.21		3840.37 ± 8.66	1847.80 ± 8.66	3840.40 ± 8.65
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	tg32 @ d4096	0.63 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_pp @ d8192	1356.39 ± 4.06		8032.20 ± 18.08	6039.62 ± 18.08	8032.24 ± 18.07
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_tg @ d8192	0.63 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	pp2048 @ d8192	1014.40 ± 12.06		4011.79 ± 24.03	2019.22 ± 24.03	4011.83 ± 24.03
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	tg32 @ d8192	0.63 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_pp @ d16384	1043.30 ± 1.05		17696.62 ± 15.86	15704.04 ± 15.86	17696.66 ± 15.85
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_tg @ d16384	0.62 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	pp2048 @ d16384	817.49 ± 11.89		4498.33 ± 36.80	2505.76 ± 36.80	4498.37 ± 36.81
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	tg32 @ d16384	0.62 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_pp @ d32768	808.63 ± 0.54		42515.50 ± 26.97	40522.93 ± 26.97	42515.54 ± 26.97
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_tg @ d32768	0.62 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	pp2048 @ d32768	603.59 ± 3.46		5385.73 ± 19.47	3393.16 ± 19.47	5385.77 ± 19.47
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	tg32 @ d32768	0.62 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_pp @ d65535	590.79 ± 2.29		112921.29 ± 430.42	110928.72 ± 430.42	112921.34 ± 430.42
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_tg @ d65535	0.62 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	pp2048 @ d65535	396.64 ± 2.60		7156.22 ± 33.75	5163.65 ± 33.75	7156.26 ± 33.75
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	tg32 @ d65535	0.62 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_pp @ d100000	463.92 ± 0.85		217546.68 ± 397.46	215554.11 ± 397.46	217546.72 ± 397.46
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	ctx_tg @ d100000	0.61 ± 0.00	1.00 ± 0.00
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	pp2048 @ d100000	288.37 ± 2.29		9094.98 ± 56.62	7102.40 ± 56.62	9095.02 ± 56.63
amd/Qwen3-235B-A22B-Instruct-2507-MXFP4	tg32 @ d100000	0.61 ± 0.00	1.00 ± 0.00

eugr · March 5, 2026, 5:51am

MXFP4 container has been tuned for gpt-oss-120b only and is not guaranteed to work with other models. I would recommend AWQ or INT4-Autoround (if that quant exists).

I haven’t tested NVFP4 with the most recent build yet - there are some improvements in flashinfer and vllm, so it may be a viable candidate too.

With QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ I was getting 26 t/s on dual Sparks.

anlnh · March 5, 2026, 8:57am

Thank you so much for the clarification! That explains why I was struggling with MXFP4. 26 t/s is impressive—I will try the QuantTrio AWQ model and update the vLLM build as you suggested. Appreciate the help!

anlnh · March 6, 2026, 9:03am

Hey Eugr, thanks again for the previous tips!

I’m now trying to run Qwen3.5 397B (MoE) using the AWQ version. However, I’m running into version mismatch issues between vLLM and Transformers in my current environment.

Should I attempt to manually rebuild/overwrite the Dockerfile with the latest versions of vLLM and Transformers? Or do you have a recommendation for a more stable quantization format or a specific vLLM build/branch that is better optimized for this 397B scale on a dual-node setup?

I’d love to hear your thoughts on the best stack to benchmark this beast. Thanks!

josephbreda · March 6, 2026, 1:19pm

You need to run ./build-and-copy.sh with the –tf5 flag. You should also use the -t argument to give it a tag

anlnh · March 10, 2026, 1:01am

Hi again! Thanks for the tip on ./build-and-copy.sh --tf5. It worked perfectly, and I can now load the model.

However, I’ve hit a new bottleneck: OOM (Out of Memory) during the cache_block allocation phase for the Qwen 3.5 397B AWQ model.

My setup is a Dual DGX Spark (256GB total VRAM). With the model weights taking up a huge chunk of memory, there isn’t much left for the KV Cache.

What is the best way to optimize the memory footprint for this 397B scale on 2 nodes? Should I lower the gpu_memory_utilization, or is there a specific max_model_len or --kv-cache-dtype fp8 setting you recommend to fit this into 256GB VRAM? Thanks for your support!"

adi-sonusflow · March 10, 2026, 1:05am

Check out my thread on the OOM and Qwen3-5-397B - I had been fighting this for last 3 days.

josephbreda · March 10, 2026, 1:15am

Happy to help. I’m not sure what else you are running on the boxes, but here is what I used to start the model on my units:

./spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5
–apply-mod ./spark-vllm-docker/mods/fix-qwen3.5-autoround
–apply-mod ./spark-vllm-docker/mods/fix-qwen3.5-chat-template
-e VLLM_MARLIN_USE_ATOMIC_ADD=1
exec vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound
–chat-template unsloth.jinja
–max-model-len 128000
–gpu-memory-utilization 0.85
–port 8555
–host 0.0.0.0
–kv_cache_dtype fp8
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–reasoning-parser qwen3
–enable-prefix-caching
–max-num-batched-tokens 8192
–trust-remote-code
-tp 2
–distributed-executor-backend ray

A couple things to try:
Set -kv_cache_dtype fp8

Try a lower max-model-len and work your way up – I know I can start with 128K tokens with Open WebUI and a couple of MCP servers running but not much else.

Don’t use –load-format fastsafetensors for this gpu memory utilization. Good luck.

trystan1 · March 18, 2026, 4:16am

A few more to watch, these are pretty exciting for a potentially large boost to decode/memory bandwidth

[ONLY FOR TEST ][MLA] Add nvfp4 packed KV cache decode path via dequant cache op #32220 by baonudesifeizhai · Pull Request #32499 · vllm-project/vllm

feat: Add FP4 KV cache quant/dequant kernels by samuellees · Pull Request #2757 · flashinfer-ai/flashinfer

Now if nvfp4 through flashinfer/cutlass would just stop crashing we’d have a tangible improvement instead of more community ‘breakthroughs’

eugr · March 18, 2026, 5:45am

While this would definitely bring a huge boost, I would be hesitant to go below fp8 in KV cache as it is usually much more sensitive to quantization than the model weights.

starkrun · March 23, 2026, 4:22am

What made you think it wouldn’t OOM on your 2 Sparks? QuantTrio/Qwen3.5-397B-A17B-AWQ is 244GB according to the HF repo page. Even if that left enough of the 256GB total memory for both OSes and VLLM (very close), it would leave nothing for context size, prompt cache, etc. Use the 199GB model josephbreda linked.

trystan1 · March 25, 2026, 2:43pm

Update :

Nvfp4 marlin backend : working

Nvfp4 vllm Cutlass : working

Nvfp4 flashinfer_cutlass : illegal exceptions for days

For those that have started running for nvfp4 models.

trystan1 · March 25, 2026, 3:30pm

@eugr

Just a heads up this pr that was just merged currently needs the GPU arch set to 12.0f to compile without ‘no nvfp4 kernel’ runtime errors.

github.com/vllm-project/vllm

[Bugfix] Preserve CUDA arch suffix (a/f) for SM12x — fixes NVFP4 NaN on desktop Blackwell (#37725)

main ← RobTand:fix/cmake-sm12x-arch-suffix

opened 11:45PM - 20 Mar 26 UTC

RobTand

+6 -4

## Summary vLLM's cmake build strips the architecture-specific suffix (`a`/`f`)… from CUDA gencode flags, causing SM12x (DGX Spark, RTX 5090) to compile as plain `sm_120` instead of `sm_120a`/`sm_121a`. Without the suffix, `__CUDA_ARCH_FAMILY_SPECIFIC__` is not defined. This causes NVIDIA's own `cuda_fp4.hpp` (via `__CUDA_FP8_INTERNAL_CAN_RELY_ON_PTX_FOR_SHORTTYPESCVT__`) to disable the native `cvt.rn.satfinite.e2m1x2.f32` PTX instruction and fall back to software E2M1 conversion, which produces **NaN during NVFP4 inference**. ### Root cause Two bugs in `cmake/utils.cmake`: 1. **`string_to_ver`** regex `([0-9]+)([0-9])` drops the `a`/`f` suffix — `121a` becomes `12.1` instead of `12.1a` 2. **`extract_unique_cuda_archs_ascending`** regex `[0-9]+a?` matches `a` but not `f` suffix Plus `CUDA_SUPPORTED_ARCHS` in `CMakeLists.txt` is missing `12.1` (SM121/DGX Spark). ### Fix 1. `string_to_ver`: `([0-9])` → `([0-9][af]?)` — preserves suffix 2. `extract_unique_cuda_archs_ascending`: `[0-9]+a?` → `[0-9]+[af]?` — matches both suffixes 3. `CUDA_SUPPORTED_ARCHS`: add `12.1` ### Verification Before fix: `cuobjdump -lelf _C.abi3.so` shows `sm_120` cubins (no suffix) After fix: produces `sm_121a` cubins with `__CUDA_ARCH_FAMILY_SPECIFIC__=1210` Confirmed on DGX Spark that native E2M1 PTX works correctly with `sm_121a`: ``` PTX path enabled: 1 E2M1 conversion: float2(1.500000, 0.500000) -> 0x13 ✓ ``` ### Testing - [x] Verified native E2M1 PTX instruction works on SM121 with correct compile flags - [x] Pre-commit checks pass ### Related - #35947 by @blake-snc — software E2M1 fallback (workaround for this build bug) - Credit to @depaulmillz for [pointing out](https://github.com/vllm-project/vllm/pull/35947#issuecomment-4099522985) that the instruction IS supported on SM12x with correct flags - #34822 by @88plug — SM121 platform detection (complementary) - #37700 — our FLA Hopper/TMA misclassification fix (separate issue) Signed-off-by: Rob Tand <robert.tand@icloud.com>

Waiting for mgoin to complete his rip and tear through GitHub before I try my flashinfer_cutlass stress test again.

This will have been the 4th time I’ve gotten my hopes up

eugr · March 25, 2026, 3:43pm

Thanks, I’ve just ran my nightly build, will rerun again I guess :)

eugr · March 25, 2026, 4:10pm

Hmm… It is supposed to work correctly when built with 12.1a arch, but gives me “NotImplementedError: No compiled nvfp4 quantization kernel”.

EDIT: I guess I misread your post - you were warning exactly about this (facepalm).

trystan1 · March 25, 2026, 4:39pm

My guess is the crashing/illegal instructions persist but I’m willing to attempt again. For all I know there are multiple bugs landing in the same result judging the amount of complexity in the kernel feature/selection process across Cutlass/flashinfer/vllm/flashattn etc

But it is nice to see them make progress on actually adding a spark into their ci/cd.

eugr · March 25, 2026, 5:49pm

OK, Johnny submitted a PR that should fix it, compiling now: [NVIDIA] Fix DGX Spark logic by johnnynunez · Pull Request #38126 · vllm-project/vllm · GitHub

trystan1 · March 25, 2026, 5:51pm

This should have all been mainlined before the spark was even available to purchase.

Sad state to see the latest Intel GPUs land with day 0 support when we’re a year later in spark PR hell

Topic		Replies	Views
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6419	March 28, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1521	January 7, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2271	December 25, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1191	February 13, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2781	December 31, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4201	February 27, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	214	4956	March 27, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4474	April 11, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2449	March 26, 2026
Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark DGX Spark / GB10 jetson , llama , nemotron	7	1642	February 23, 2026

PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

Related topics