I’m using MXFP4 model weights and I’m specifically looking to ensure vLLM uses GB10’s native FP4 tensor-core kernels rather than slower fallbacks. I would like to get vLLM to the same performance class as the SGLang feature branch or recent llama.cpp improvements.
At the moment, vLLM runs, but it looks like we’re not hitting the native FP4 (NVFP4) / Blackwell-optimized MoE GEMM path. My working hypothesis is that vLLM is either:
-
falling back to a slower Marlin/weight-only FP4 path, or
-
not enabling the intended FlashInfer/CUTLASS group GEMM backend for MXFP4 MoE on sm_121a, or
-
losing perf due to GB10-specific toolchain/backend gating (Triton/PTXAS, attention backend selection, async scheduling / CUDA graphs, etc.).
What I’m looking for:
-
What is the intended “fast path” on GB10 for gpt-oss-120b MXFP4 in vLLM (MoE GEMM + attention)?
-
Which versions of vLLM / FlashInfer / Triton / PyTorch / CUDA are currently recommended on GB10 to get that fast path?
-
Are there known backend gating or shape/padding/packing constraints on GB10 that prevent MXFP4 MoE from selecting the fastest kernels?
-
If I want to contribute, what’s the highest-impact area:
-
enabling/validating FlashInfer/CUTLASS MXFP4 MoE on sm_121a,
-
fixing Triton toolchain/ptxas issues for sm_121a,
-
or vLLM runtime/scheduler issues (async scheduling, batch queue, CUDA graphs)?
-
I can run A/B benchmarks, collect logs, and provide Nsight traces. If there’s a specific checklist to confirm we’re on the “native FP4” path (expected log messages, env vars, kernels to look for), I’ll follow it and report back.
Thanks! I am happy to help test or upstream fixes.