Running Nemotron 3 Super 120B (NVFP4/modelopt_mixed) on a DGX Spark (GB10, SM12.1) with vllm/vllm-openai:v0.17.1. Got it working, but had to work around a nasty silent corruption issue: FlashInferFP8ScaledMMLinearKernel and CutlassFP8ScaledMMLinearKernel are compiled for SM12.0 at most — on SM12.1 they pass the is_supported() check but produce all-NaN logits at runtime (no Python exception, just garbage tokens). The fix was to patch those two is_supported() methods to reject SM12.1, forcing fallback to PerTensorTorchFP8ScaledMMLinearKernel (torch._scaled_mm / cuBLAS), which works correctly but only gives ~15 tokens/s rather than the 24+ you’d expect from native FP8.
My gut says there may be a special build or a recent update somewhere that properly targets SM12.1 — either a flashinfer wheel built against SM12.1, a custom vLLM image, or a pending upstream PR. Has anyone found a path to native SM12.1 FP8 kernels? Would love to know if there’s a build I’m missing before I go down the route of compiling flashinfer from source against SM_121.
I did scour the forum but was not able to find any. Maybe I did not look at the right places ?
Thanks for any help!