NVFP4 issue root cause?

amcintyre · April 4, 2026, 12:01am

From what I understand talking to Opus, this is the underlying issue that will resolve NVFP4 use in TensorRT-LLM, vLLM, and SGLang.

is that true? or is Opus hallucinating?

github.com/NVIDIA/cutlass

[BUG] StageCountAutoCarveout assumes max family SMEM, breaks SM121 (99 KiB vs SM120 228 KiB)

opened 10:32PM - 02 Apr 26 UTC

mihai-chiorean

### Problem `StageCountAutoCarveout` computes pipeline stages using the archite…cture family's maximum `SharedMemoryCapacity`, not the actual device's runtime shared memory limit. Within the SM12x family: - **SM120** (B200, RTX PRO 6000): **228 KiB** shared memory per block - **SM121** (DGX Spark GB10): **99 KiB** shared memory per block When CUTLASS compiles grouped GEMM kernels for `compute_120f`, `StageCountAutoCarveout` selects stage counts that fit 228 KiB. At runtime on SM121, `gemm.initialize()` fails because `cudaFuncSetAttribute(MaxDynamicSharedMemorySize)` receives a value exceeding 99 KiB. ### Specific failure MoE grouped GEMM with FP4 (`__nv_fp4_e2m1`), tile `CtaShape128x256x64B`: ``` MoE grouped GEMM requires 102400 bytes shared memory but device supports 101376 (cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:39) ``` 102400 bytes (100 KiB) vs 101376 bytes (99 KiB) -- exactly 1 KiB over. ### Impact In TensorRT-LLM on DGX Spark (SM121): - 13 of 16 autotuner MoE GEMM tactics fail - Surviving tactics are small/inefficient: 4.8 tok/s (vs 24 tok/s with llama.cpp software dequant) - Users must bypass CUTLASS entirely (e.g., Triton MoE backend) to get reasonable performance (32-40 tok/s) ### Root cause `StageCountAutoCarveout` is a compile-time policy. For a given architecture target (`sm_120f`), it picks the maximum stages that fit the architecture's shared memory spec. But SM12x is not homogeneous -- SM121 has less than half the SMEM of SM120. There's no runtime path to reduce stages based on `cudaDevAttrMaxSharedMemoryPerBlockOptin`. ### Current workaround (in TensorRT-LLM) We filter candidate tile configs at the TRT-LLM level before they reach CUTLASS, keeping only tiles that fit within the device's actual SMEM. This works but pushes device-awareness to every consumer of CUTLASS. ### Suggested fix A runtime-aware `StageCount` policy that queries `cudaDevAttrMaxSharedMemoryPerBlockOptin` and clamps stages to fit, or a mechanism to pass max SMEM as a runtime parameter to the grouped GEMM kernel. ### Environment - Device: DGX Spark GB10 (SM121, `cudaDevAttrMaxSharedMemoryPerBlockOptin` = 101376) - CUDA: 13.1 - CUTLASS: 4.4.x (via TensorRT-LLM 1.3.0rc10) - Arch target: `compute_120f` / `sm_120f` ### Related issues - #3096 -- SM120 NVFP4 MoE garbage output (different root cause) - #2800 -- BlockScaledMmaOp restricts FP4 to sm_100a only - #2614 -- Request sm_121 support

Topic		Replies	Views
SM121 (GB10) native NVFP4 compute — seeking guidance on software support DGX Spark / GB10 cuda , kernel , nemotron	3	1017	March 25, 2026
SM121 CUTLASS Kernel Optimization Results: NVFP4 356 TFLOPS, MoE Grouped GEMM on DGX Spark DGX Spark / GB10	9	1038	February 9, 2026
[SM121] 4 bugs causing ! output + gpt-oss-120B at 59 tok/s — full root cause analysis and working serve scripts DGX Spark / GB10	1	477	April 2, 2026
Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark) DGX Spark / GB10 Projects jetson , nemotron	15	2839	April 12, 2026
Request escalation to Product Manager: NVFP4 training support on DGX Spark / SM120 (Hadamard/RHT) DGX Spark / GB10	1	279	February 4, 2026
NVFP4 Performance Update Announcements	0	2668	April 24, 2026
How to enable nvfp4 DGX Spark / GB10	5	1047	November 6, 2025
GB10 (SM12.1) vLLM FP8 inference — any progress on native SM12.1 kernels? DGX Spark / GB10 cublas , nemotron	3	851	March 27, 2026
Custom FP4 CUDA Kernel - 129 TFLOPS on DGX Spark with Pre-Quantized Weight Cache CUDA Programming and Performance cublas	4	815	February 25, 2026
Dearest CUTLASS TEAM, When the hell are you going to properly fix tcgen05 FP4 support for DGX Spark / GB10 (SM121)? DGX Spark / GB10	37	2550	April 25, 2026

NVFP4 issue root cause?

Related topics