BTW, Flashinfer implementation allows to fit more context (actually 2x more into the context compared to FLASH_ATTN one), that’s for FP8 model.
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM | 224 | 8293 | April 7, 2026 | |
| We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! | 145 | 6207 | March 28, 2026 | |
| RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark | 74 | 4182 | April 11, 2026 | |
| From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f | 10 | 1489 | January 7, 2026 | |
| New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 | 35 | 2732 | December 31, 2025 | |
| Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever | 36 | 1128 | February 13, 2026 | |
| Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D | 340 | 14369 | March 24, 2026 | |
| Can someone with 2 Sparks benchmark NVFP4 MiniMax M2.1 quant? | 25 | 1371 | January 29, 2026 | |
| FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect | 214 | 4818 | March 27, 2026 | |
| NVIDIA folks -- where is this promised nvfp4 speedup? | 27 | 2403 | March 26, 2026 |