NVFP4/FP4 isn’t being “properly utilized” on DGX Spark (GB10 / sm121) in current vLLM builds, so NVFP4 quants can be slower than AWQ 4-bit on the same workload. FP4 kernels / NVFP4 paths are better optimized for sm120 (RTX 50xx / RTX Pro 6000) than for Spark’s sm121. So, the summary is: installs got way smoother (cu130 wheels + better Docker tooling + cluster scripts), but NVFP4 performance on Spark still isn’t quite there yet.
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? | 89 | 4641 | February 13, 2026 | |
| Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) | 18 | 2665 | December 25, 2025 | |
| PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM | 234 | 13088 | May 15, 2026 | |
| Install and Use vLLM for Inference on two Sparks does not work | 159 | 5650 | December 9, 2025 | |
| Llama.cpp experimental native mxfp4 support for blackwell PR | 12 | 1680 | January 7, 2026 | |
| We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! | 144 | 8878 | March 14, 2026 | |
| Your GPU does not have native support for FP4 computation but FP4 quantization is being used | 5 | 1791 | January 30, 2026 | |
| Setting up vLLM, SGLang or TensorRT on two DGX Sparks | 16 | 2063 | December 7, 2025 | |
| Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? | 40 | 6124 | March 16, 2026 | |
| New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 | 32 | 3255 | December 17, 2025 |