I’m using MXFP4 model weights and I’m specifically looking to ensure vLLM uses GB10’s native FP4 tensor-core kernels rather than slower fallbacks. I would like to get vLLM to the same performance class as the SGLang feature branch or recent llama.cpp improvements. At the moment, vLLM runs, but it …

Announcing: vLLM + native MXFP4 for gpt-oss-120b on DGX Spark (SM121/GB10) — reproducible Docker setup I’m sharing a repo that packages a working, reproducible Docker environment for running OpenAI’s gpt-oss-120b (MXFP4 weights) on NVIDIA DGX Spark (SM121 / GB10) while moving towards the intended na…

Thanks for the reply! That matches with what I’m seeing (NVFP4/MXFP4 underperforming vs AWQ 4-bit on GB10). Do you know which specific kernel path is missing/slow on sm121 (MoE group GEMM vs attention vs packing/padding)? Also, are there recommended vLLM/FlashInfer/Triton/PyTorch/CUDA versions for…

If you want to go deeper into the rabbit hole, you can look at optimizations that SGLang guys did in Triton and SGLang itself. I built Triton with their changes locally, but it alone didn’t improve performance in vLLM even after I managed to get Triton backend working instead of FLASH_ATTN. SGLang…

[image] christopher_owen: There is also this contribution, new today: feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix · Pull Request #31740 · vllm-project/vllm · GitHub Well, this contribution doesn’t seem to add anything as I’m getting the same speeds right now. …

This commit seems to only resolve the ‘gating’ in vllm (like my attempt earlier tried to do)…. The PR did say that it was leading to TRITON_ATTN, though. I haven’t had a chance to run that patch yet. When you did was it still using Marlin kernel or did it start using the TRITON_ATTN for you? That…

[image] christopher_owen: When you did was it still using Marlin kernel or did it start using the TRITON_ATTN for you? Yes, it used Triton backend with pretty much the same performance, so I abandoned this route for now.

Following up on this discussion with an update on my vLLM work for DGX Spark / SM121: What I implemented Removed some Spark-related feature gating in vLLM. Wrote a CUTLASS-based attention kernel. Wrote a block-scaled FP8×FP4 MoE GEMM in CUTLASS (FP8 activations × MXFP4 weights). Integra…

Have you had a chance to look into the changes SGLang guys did for Spark marketing campaign? GitHub - yvbbrjdr/sglang at spark - looks like these are the key (+ enabling Triton kernels).

Thank you for the pointer. Below is my analysis of what’s driving SGLang’s success. There’s quite a bit to learn from this. One key takeaway is that it’s not only which engine is used (in this case, they’re using Triton), but also how it’s used. Their implementation has useful ideas on both fronts. …

[image] christopher_owen: If anyone is interested in real-time collaboration, I can make myself available on the vLLM slack to show what I have. At some point I plan to make a nice Dockerfile (inspired by eugr) to tie this all together, but what I have now isn’t there. Interesting… There are …

vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing?

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

raphael.amorim January 6, 2026, 3:13pm 2

NVFP4/FP4 isn’t being “properly utilized” on DGX Spark (GB10 / sm121) in current vLLM builds, so NVFP4 quants can be slower than AWQ 4-bit on the same workload. FP4 kernels / NVFP4 paths are better optimized for sm120 (RTX 50xx / RTX Pro 6000) than for Spark’s sm121. So, the summary is: installs got way smoother (cu130 wheels + better Docker tooling + cluster scripts), but NVFP4 performance on Spark still isn’t quite there yet.

Topic		Replies	Views
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	4641	February 13, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2665	December 25, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	13088	May 15, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5650	December 9, 2025
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	12	1680	January 7, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8878	March 14, 2026
Your GPU does not have native support for FP4 computation but FP4 quantization is being used DGX Spark / GB10	5	1791	January 30, 2026
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	2063	December 7, 2025
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	6124	March 16, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3255	December 17, 2025

vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing?

Related topics