GB10 (SM12.1) vLLM FP8 inference — any progress on native SM12.1 kernels?

saiful.mazli · March 15, 2026, 6:08am

Running Nemotron 3 Super 120B (NVFP4/modelopt_mixed) on a DGX Spark (GB10, SM12.1) with vllm/vllm-openai:v0.17.1. Got it working, but had to work around a nasty silent corruption issue: FlashInferFP8ScaledMMLinearKernel and CutlassFP8ScaledMMLinearKernel are compiled for SM12.0 at most — on SM12.1 they pass the is_supported() check but produce all-NaN logits at runtime (no Python exception, just garbage tokens). The fix was to patch those two is_supported() methods to reject SM12.1, forcing fallback to PerTensorTorchFP8ScaledMMLinearKernel (torch._scaled_mm / cuBLAS), which works correctly but only gives ~15 tokens/s rather than the 24+ you’d expect from native FP8.

My gut says there may be a special build or a recent update somewhere that properly targets SM12.1 — either a flashinfer wheel built against SM12.1, a custom vLLM image, or a pending upstream PR. Has anyone found a path to native SM12.1 FP8 kernels? Would love to know if there’s a build I’m missing before I go down the route of compiling flashinfer from source against SM_121.

I did scour the forum but was not able to find any. Maybe I did not look at the right places ?

Thanks for any help!

aniculescu · March 16, 2026, 5:43pm

You can check out the NVIDIA official vllm image on NGC, or the community container GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub

saiful.mazli · March 26, 2026, 3:58am

I did check it back then, my understanding is that we are still not running NVFP4 on hardware acceleration

saiful.mazli · March 27, 2026, 5:13am

If you’re running vLLM on a DGX Spark (GB10, SM12.1) and hitting this during model warmup:

This kernel only supports sm120.
cudaErrorLaunchFailure: unspecified launch failure

We found the root cause and fixed it.

What’s happening

vLLM’s Blackwell FP8 CUTLASS kernel wrappers use a guard struct called enable_sm120_only in csrc/cutlass_extensions/common.hpp:

#if __CUDA_ARCH__ == 1200
    // runs on SM12.0
#else
    asm("trap;");   // kills SM12.1 (ARCH == 1210)

GB10 is SM12.1, not SM12.0. When you add 12.1f to SCALED_MM_ARCHS (which you must, for NVFP4 acceleration), nvcc compiles sm_121f cubins with __CUDA_ARCH__ = 1210. Every FP8 GEMM call hits the trap. The crash fires ~40 times during Torch Inductor warmup and the model never starts.

The irony: enable_sm120_family already exists in the same file with the correct guard (>= 1200 && < 1300). It just wasn’t being used.

The fix

Two files in vLLM source need one substitution each:

sed -i 's/enable_sm120_only/enable_sm120_family/g' \
    csrc/quantization/w8a8/cutlass/c3x/scaled_mm.cuh \
    csrc/quantization/w8a8/cutlass/c3x/scaled_mm_sm120_fp8_dispatch.cuh

Or use the patch script if you’re building in Docker.

Result

Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod  ✅
Using NvFp4LinearBackend.VLLM_CUTLASS for NVFP4 GEMM               ✅
Application startup complete.                                         ✅

Tested on Nemotron-3-Super-120B-A12B-NVFP4, 10-minute soak at 32 concurrent requests, zero errors.

Why did it ever work before?

Earlier vLLM builds (including v0.17.1) didn’t have 12.1f in SCALED_MM_ARCHS, so no sm_121f cubins were compiled. CUDA silently fell back to PTX JIT from the sm_120f binary, which was compiled with __CUDA_ARCH__ = 1200 — the guard passed. Once you add 12.1f (needed for NVFP4), the bug surfaces.

Patches and full context

👉 https://github.com/saifgithub/vllm-gb10-sm121

Huge credit to Avarok Cybersecurity and eugr for the foundational GB10 bring-up work. This fix is the last piece that was blocking native FP8 on SM12.1.

Topic		Replies	Views
SM121 (GB10) native NVFP4 compute — seeking guidance on software support DGX Spark / GB10 cuda , kernel , nemotron	3	1017	March 25, 2026
Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark) DGX Spark / GB10 Projects jetson , nemotron	15	2841	April 12, 2026
[SM121] 4 bugs causing ! output + gpt-oss-120B at 59 tok/s — full root cause analysis and working serve scripts DGX Spark / GB10	1	477	April 2, 2026
Spark and vllm DGX Spark / GB10 Projects	0	181	April 9, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	143	7905	February 24, 2026
Your GPU does not have native support for FP4 computation but FP4 quantization is being used DGX Spark / GB10	5	2055	January 30, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2744	December 25, 2025
Some new development work for Qwen3 on the Spark DGX Spark / GB10	5	851	February 3, 2026
DGX Spark: 13 → 49 tok/s with Qwen3.5-35B — Native SM121 Kernel Build Guide DGX Spark / GB10 Projects cuda , cusparse	13	1449	April 1, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	13591	May 15, 2026

GB10 (SM12.1) vLLM FP8 inference — any progress on native SM12.1 kernels?

Related topics