# Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark)
Sharing some more results from testing NVFP4 on the Spark — this time digging into *why* the default backend is broken and how the Marlin fix works.
**TL;DR:** Set 3 environment variables and NVFP4 goes from broken/slow to **50 tok/s on DGX Spark**. Marlin is 16% faster and uses 7 GB less memory than the default FlashInfer path.
—
## The Problem
NVFP4 models on DGX Spark (SM121) are using **broken CUTLASS kernels**. You might not even know it — vLLM doesn’t crash, it silently falls back to slower codepaths that use more memory and run 16% slower.
Check your vLLM logs. If you see this, you’re affected:
```
[Autotuner]: Skipping tactic … due to failure while profiling:
[TensorRT-LLM][ERROR] Failed to initialize cutlass TMA WS grouped gemm
```
## Why It’s Broken
SM121 (DGX Spark GB10) lacks `tcgen05` tensor core instructions that datacenter Blackwell (SM100/SM110) has. vLLM’s backend auto-selection picks `FLASHINFER_CUTLASS` because SM121 has capability >= 100:
```python
# vLLM source: nvfp4_utils.py lines 59-64
if current_platform.has_device_capability(100) and has_flashinfer():
backend = NvFp4LinearBackend.FLASHINFER_CUTLASS # ← BROKEN on sm_121!
```
The CUTLASS FP4 kernels generate `cvt with .e2m1x2` PTX instructions not supported on SM121. The autotuner detects this and skips the broken tactics, falling back to whatever works — which is slower and uses more memory.
## The Fix: 3 Environment Variables
```bash
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_TEST_FORCE_FP8_MARLIN=1
```
That’s it. This forces the **Marlin backend**, which dequantizes FP4 to BF16 on the fly using operations that work correctly on SM121. Marlin only needs capability >= 75 (Turing), so SM121 is well supported.
Your vLLM logs should now show:
```
Using NvFp4LinearBackend.MARLIN for NVFP4 GEMM
Using ‘MARLIN’ NvFp4 MoE backend out of potential backends: [‘VLLM_CUTLASS’, ‘MARLIN’]
```
## What Each Variable Does
| Variable | Value | Purpose |
|----------|-------|---------|
| `VLLM_USE_FLASHINFER_MOE_FP4` | `0` | Disables FlashInfer’s FP4 MoE kernel path |
| `VLLM_NVFP4_GEMM_BACKEND` | `marlin` | Forces Marlin for all NVFP4 linear layers |
| `VLLM_TEST_FORCE_FP8_MARLIN` | `1` | Also routes FP8 operations through Marlin |
## Benchmark Proof
Tested on DGX Spark GB10 with Nemotron-3-Nano-30B-A3B-NVFP4 (19 GB model), identical settings except backend:
| Backend | Memory | tok/s | Notes |
|---------|:------:|:-----:|-------|
| **Marlin** | **32 GB** | **50.0** | Clean, no errors |
| FlashInfer (default) | 39 GB | 42.6 | CUTLASS errors in log, falls back |
Marlin: **16% faster, 7 GB less memory, zero errors.**
## Full Launch Command
```bash
docker run -d --runtime=nvidia \
--name nemotron-nvfp4 \\
-v /path/to/hf-cache:/root/.cache/huggingface \
-p 8000:8000 \\
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \\
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
vllm-node:latest \\
python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 8000 \\
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--enforce-eager \\
--gpu-memory-utilization 0.2 \
--max-model-len 8192 \\
--kv-cache-dtype fp8 \
--trust-remote-code
```
## Does This Apply to Other NVFP4 Models?
Yes — any NVFP4/ModelOpt FP4 model running on SM121 (DGX Spark) or SM120 (RTX 5090, RTX PRO 6000) should benefit from the Marlin backend. The CUTLASS FP4 kernel issue affects all consumer Blackwell GPUs that lack `tcgen05`.
Models we’ve seen reported affected:
- Nemotron-3-Nano-30B-A3B-NVFP4
- Nemotron-3-Super-120B-A12B-NVFP4
- Qwen3-VL-235B-A22B-NVFP4
- Qwen3.5-122B-A10B-NVFP4
- GLM-4.7-Flash-NVFP4
## When Will Native FP4 Work on SM121?
No timeline from NVIDIA. Active PRs:
- CUTLASS #3038: SM121-gated MXFP4 kernel wiring
- vLLM #35947: Software E2M1 conversion for SM12x
- vLLM #38126: Architecture suffix preservation (merged)
Until native support lands, Marlin is the recommended path. It’s not using native FP4 tensor cores (it dequantizes to BF16), but it’s still faster than the broken CUTLASS fallback and delivers the full memory savings of the NVFP4 checkpoint format.
## Credit
The Marlin backend discovery came from the DGX Spark community:
- [We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ!]( We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! )
—
*Tested March 26, 2026 — DGX Spark GB10, CUDA 13.2, Driver 580.142, vLLM 0.18.1rc1 (eugr build)*