vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing?

christopher_owen · January 6, 2026, 4:38pm

Thanks for the reply! That matches with what I’m seeing (NVFP4/MXFP4 underperforming vs AWQ 4-bit on GB10).

Do you know which specific kernel path is missing/slow on sm121 (MoE group GEMM vs attention vs packing/padding)? Also, are there recommended vLLM/FlashInfer/Triton/PyTorch/CUDA versions for Spark right now, and any upstream issues/PRs to track?

I can test patches and provide Nsight traces.

I’m also willing to attempt to contribute, but I’m lacking a little direction. I attempted to make a patch that tried to address sm121 gating, but that’s clearly not enough. Run VLLM in Spark - #118 by christopher_owen

There is also this contribution, new today: feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix · Pull Request #31740 · vllm-project/vllm · GitHub

As well as this contribution, which didn’t appear well received: [Bugfix] Add SM 12.1 support by ohsono · Pull Request #31607 · vllm-project/vllm · GitHub

On the flashInfer side, we see:

github.com/flashinfer-ai/flashinfer

feat: initial support for SM103, SM110, SM120, SM121 (#1608)

committed 08:12PM - 02 Sep 25 UTC

aleozlx

+1719 -914

## 📌 Description ## 🔍 Related Issues ## 🚀 Pull Request Checklist Thank yo…u for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes --------- Co-authored-by: Vincent Huang <vincenth@nvidia.com> Co-authored-by: Yong Wu <yowu@nvidia.com> Co-authored-by: Sunghyun Park <sunghyunp@nvidia.com> Co-authored-by: Yunzhe Qiu <yunzheq@nvidia.com> Co-authored-by: Brian Ryu <bryu@nvidia.com> Co-authored-by: Ka-Hyun Nam <knam@nvidia.com> Co-authored-by: yzh119 <zihaoy@nvidia.com> Co-authored-by: Zihao Ye <expye@outlook.com>

github.com/flashinfer-ai/flashinfer

[NVIDIA] Thor & Spark Support (#2028)

committed 09:54AM - 13 Nov 25 UTC

johnnynunez

+17 -7

## 📌 Description Thor and Spark support when wheels are generating ## 🔍 Relate…d Issues Output says that is not compatible. Only with JIT is working. ## Summary by CodeRabbit * **New Features** * Broadened GPU architecture support to include additional newer architectures. * **Documentation** * Updated README and installation docs to show the revised CUDA architecture example list. * **Chores** * Adjusted release/nightly workflows and build scripts to select architectures using an expanded CUDA-version threshold and branching logic. * **Performance** * Extended architecture-specific build/runtime handling to cover an additional GPU architecture affecting memory-related behavior. --------- Co-authored-by: Zihao Ye <expye@outlook.com> Co-authored-by: yzh119 <zihaoy@nvidia.com>

(for example, I don’t understand why there is different build strings for cuda ‘< 13.0’ and the rest - and so the difference between 12.0a and 12.0f.

github.com/flashinfer-ai/flashinfer

.github/workflows/release.yml

5a8bcf663


      
          
          - name: Checkout code
            uses: actions/checkout@v4
            with:
              ref: ${{ github.event_name == 'pull_request' && github.head_ref || inputs.tag }}
              submodules: true
          
          - name: Build wheel in container
            env:
              DOCKER_IMAGE: ${{ matrix.arch == 'aarch64' && format('pytorch/manylinuxaarch64-builder:cuda{0}', matrix.cuda) || format('pytorch/manylinux2_28-builder:cuda{0}', matrix.cuda) }}
              FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda < '13.0' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f' }}
            run: |
              # Extract CUDA major and minor versions
              CUDA_MAJOR=$(echo "${{ matrix.cuda }}" | cut -d'.' -f1)
              CUDA_MINOR=$(echo "${{ matrix.cuda }}" | cut -d'.' -f2)
              export CUDA_MAJOR
              export CUDA_MINOR
              export FLASHINFER_LOCAL_VERSION="cu${CUDA_MAJOR}${CUDA_MINOR}"
          
              chown -R $(id -u):$(id -g) ${{ github.workspace }}
              mkdir -p ${{ github.workspace }}/ci-cache

Topic		Replies	Views
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	4641	February 13, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2664	December 25, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	13087	May 15, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5650	December 9, 2025
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	12	1680	January 7, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8878	March 14, 2026
Your GPU does not have native support for FP4 computation but FP4 quantization is being used DGX Spark / GB10	5	1791	January 30, 2026
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	2063	December 7, 2025
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	6124	March 16, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3255	December 17, 2025

vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing?

Related topics