GB10 (SM12.1) vLLM FP8 inference — any progress on native SM12.1 kernels?

Running Nemotron 3 Super 120B (NVFP4/modelopt_mixed) on a DGX Spark (GB10, SM12.1) with vllm/vllm-openai:v0.17.1. Got it working, but had to work around a nasty silent corruption issue: FlashInferFP8ScaledMMLinearKernel and CutlassFP8ScaledMMLinearKernel are compiled for SM12.0 at most — on SM12.1 they pass the is_supported() check but produce all-NaN logits at runtime (no Python exception, just garbage tokens). The fix was to patch those two is_supported() methods to reject SM12.1, forcing fallback to PerTensorTorchFP8ScaledMMLinearKernel (torch._scaled_mm / cuBLAS), which works correctly but only gives ~15 tokens/s rather than the 24+ you’d expect from native FP8.

My gut says there may be a special build or a recent update somewhere that properly targets SM12.1 — either a flashinfer wheel built against SM12.1, a custom vLLM image, or a pending upstream PR. Has anyone found a path to native SM12.1 FP8 kernels? Would love to know if there’s a build I’m missing before I go down the route of compiling flashinfer from source against SM_121.

I did scour the forum but was not able to find any. Maybe I did not look at the right places ?

Thanks for any help!


1 Like

You can check out the NVIDIA official vllm image on NGC, or the community container GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub

1 Like

I did check it back then, my understanding is that we are still not running NVFP4 on hardware acceleration

2 Likes

If you’re running vLLM on a DGX Spark (GB10, SM12.1) and hitting this during model warmup:

This kernel only supports sm120.
cudaErrorLaunchFailure: unspecified launch failure

We found the root cause and fixed it.


What’s happening

vLLM’s Blackwell FP8 CUTLASS kernel wrappers use a guard struct called enable_sm120_only in csrc/cutlass_extensions/common.hpp:

#if __CUDA_ARCH__ == 1200
    // runs on SM12.0
#else
    asm("trap;");   // kills SM12.1 (ARCH == 1210)

GB10 is SM12.1, not SM12.0. When you add 12.1f to SCALED_MM_ARCHS (which you must, for NVFP4 acceleration), nvcc compiles sm_121f cubins with __CUDA_ARCH__ = 1210. Every FP8 GEMM call hits the trap. The crash fires ~40 times during Torch Inductor warmup and the model never starts.

The irony: enable_sm120_family already exists in the same file with the correct guard (>= 1200 && < 1300). It just wasn’t being used.


The fix

Two files in vLLM source need one substitution each:

sed -i 's/enable_sm120_only/enable_sm120_family/g' \
    csrc/quantization/w8a8/cutlass/c3x/scaled_mm.cuh \
    csrc/quantization/w8a8/cutlass/c3x/scaled_mm_sm120_fp8_dispatch.cuh

Or use the patch script if you’re building in Docker.


Result

Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod  ✅
Using NvFp4LinearBackend.VLLM_CUTLASS for NVFP4 GEMM               ✅
Application startup complete.                                         ✅

Tested on Nemotron-3-Super-120B-A12B-NVFP4, 10-minute soak at 32 concurrent requests, zero errors.


Why did it ever work before?

Earlier vLLM builds (including v0.17.1) didn’t have 12.1f in SCALED_MM_ARCHS, so no sm_121f cubins were compiled. CUDA silently fell back to PTX JIT from the sm_120f binary, which was compiled with __CUDA_ARCH__ = 1200 — the guard passed. Once you add 12.1f (needed for NVFP4), the bug surfaces.


Patches and full context

👉 https://github.com/saifgithub/vllm-gb10-sm121

Huge credit to Avarok Cybersecurity and eugr for the foundational GB10 bring-up work. This fix is the last piece that was blocking native FP8 on SM12.1.

1 Like