vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing?

Thanks for the reply! That matches with what I’m seeing (NVFP4/MXFP4 underperforming vs AWQ 4-bit on GB10).

Do you know which specific kernel path is missing/slow on sm121 (MoE group GEMM vs attention vs packing/padding)? Also, are there recommended vLLM/FlashInfer/Triton/PyTorch/CUDA versions for Spark right now, and any upstream issues/PRs to track?

I can test patches and provide Nsight traces.

I’m also willing to attempt to contribute, but I’m lacking a little direction. I attempted to make a patch that tried to address sm121 gating, but that’s clearly not enough. Run VLLM in Spark - #118 by christopher_owen

There is also this contribution, new today: feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix · Pull Request #31740 · vllm-project/vllm · GitHub

As well as this contribution, which didn’t appear well received: [Bugfix] Add SM 12.1 support by ohsono · Pull Request #31607 · vllm-project/vllm · GitHub

On the flashInfer side, we see:

(for example, I don’t understand why there is different build strings for cuda ‘< 13.0’ and the rest - and so the difference between 12.0a and 12.0f.