Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark)

# Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark)

Sharing some more results from testing NVFP4 on the Spark — this time digging into *why* the default backend is broken and how the Marlin fix works.

**TL;DR:** Set 3 environment variables and NVFP4 goes from broken/slow to **50 tok/s on DGX Spark**. Marlin is 16% faster and uses 7 GB less memory than the default FlashInfer path.

## The Problem

NVFP4 models on DGX Spark (SM121) are using **broken CUTLASS kernels**. You might not even know it — vLLM doesn’t crash, it silently falls back to slower codepaths that use more memory and run 16% slower.

Check your vLLM logs. If you see this, you’re affected:

```

[Autotuner]: Skipping tactic … due to failure while profiling:

[TensorRT-LLM][ERROR] Failed to initialize cutlass TMA WS grouped gemm

```

## Why It’s Broken

SM121 (DGX Spark GB10) lacks `tcgen05` tensor core instructions that datacenter Blackwell (SM100/SM110) has. vLLM’s backend auto-selection picks `FLASHINFER_CUTLASS` because SM121 has capability >= 100:

```python

# vLLM source: nvfp4_utils.py lines 59-64

if current_platform.has_device_capability(100) and has_flashinfer():

backend = NvFp4LinearBackend.FLASHINFER_CUTLASS  # ← BROKEN on sm_121!

```

The CUTLASS FP4 kernels generate `cvt with .e2m1x2` PTX instructions not supported on SM121. The autotuner detects this and skips the broken tactics, falling back to whatever works — which is slower and uses more memory.

## The Fix: 3 Environment Variables

```bash

VLLM_USE_FLASHINFER_MOE_FP4=0

VLLM_NVFP4_GEMM_BACKEND=marlin

VLLM_TEST_FORCE_FP8_MARLIN=1

```

That’s it. This forces the **Marlin backend**, which dequantizes FP4 to BF16 on the fly using operations that work correctly on SM121. Marlin only needs capability >= 75 (Turing), so SM121 is well supported.

Your vLLM logs should now show:

```

Using NvFp4LinearBackend.MARLIN for NVFP4 GEMM

Using ‘MARLIN’ NvFp4 MoE backend out of potential backends: [‘VLLM_CUTLASS’, ‘MARLIN’]

```

## What Each Variable Does

| Variable | Value | Purpose |

|----------|-------|---------|

| `VLLM_USE_FLASHINFER_MOE_FP4` | `0` | Disables FlashInfer’s FP4 MoE kernel path |

| `VLLM_NVFP4_GEMM_BACKEND` | `marlin` | Forces Marlin for all NVFP4 linear layers |

| `VLLM_TEST_FORCE_FP8_MARLIN` | `1` | Also routes FP8 operations through Marlin |

## Benchmark Proof

Tested on DGX Spark GB10 with Nemotron-3-Nano-30B-A3B-NVFP4 (19 GB model), identical settings except backend:

| Backend | Memory | tok/s | Notes |

|---------|:------:|:-----:|-------|

| **Marlin** | **32 GB** | **50.0** | Clean, no errors |

| FlashInfer (default) | 39 GB | 42.6 | CUTLASS errors in log, falls back |

Marlin: **16% faster, 7 GB less memory, zero errors.**

## Full Launch Command

```bash

docker run -d --runtime=nvidia \

--name nemotron-nvfp4 \\

-v /path/to/hf-cache:/root/.cache/huggingface \

-p 8000:8000 \\

-e VLLM_USE_FLASHINFER_MOE_FP4=0 \

-e VLLM_NVFP4_GEMM_BACKEND=marlin \\

-e VLLM_TEST_FORCE_FP8_MARLIN=1 \

vllm-node:latest \\

python3 -m vllm.entrypoints.openai.api_server \

    --host 0.0.0.0 --port 8000 \\

--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \

    --enforce-eager \\

--gpu-memory-utilization 0.2 \

    --max-model-len 8192 \\

--kv-cache-dtype fp8 \

    --trust-remote-code

```

## Does This Apply to Other NVFP4 Models?

Yes — any NVFP4/ModelOpt FP4 model running on SM121 (DGX Spark) or SM120 (RTX 5090, RTX PRO 6000) should benefit from the Marlin backend. The CUTLASS FP4 kernel issue affects all consumer Blackwell GPUs that lack `tcgen05`.

Models we’ve seen reported affected:

- Nemotron-3-Nano-30B-A3B-NVFP4

- Nemotron-3-Super-120B-A12B-NVFP4

- Qwen3-VL-235B-A22B-NVFP4

- Qwen3.5-122B-A10B-NVFP4

- GLM-4.7-Flash-NVFP4

## When Will Native FP4 Work on SM121?

No timeline from NVIDIA. Active PRs:

- CUTLASS #3038: SM121-gated MXFP4 kernel wiring

- vLLM #35947: Software E2M1 conversion for SM12x

- vLLM #38126: Architecture suffix preservation (merged)

Until native support lands, Marlin is the recommended path. It’s not using native FP4 tensor cores (it dequantizes to BF16), but it’s still faster than the broken CUTLASS fallback and delivers the full memory savings of the NVFP4 checkpoint format.

## Credit

The Marlin backend discovery came from the DGX Spark community:

- [We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ!]( We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! )

*Tested March 26, 2026 — DGX Spark GB10, CUDA 13.2, Driver 580.142, vLLM 0.18.1rc1 (eugr build)*

9 Likes

continued testing since this post with compiled findings and configs into a GitHub repo for easier reference:

Includes the Marlin fix from this thread, plus TurboQuant KV cache compression (240K context at 64 GB) and a working mamba-ssm build for aarch64.

3 Likes

this is gold. thank you for sharing.

This is interest info, why did @eugr ‘s spark-vllm-docker switch the NVFP4 models back over to CUTLASS backend from using MARLIN before? (about 3 weeks ago).

I’m going from memory of this but…

(1) there was actually a specific performance regression in marlin at the time (this stuff changes on a nightly basis) – so at the time the immediate change was made – it was done to protect people from a performance regression

(2) there is recently a lot of ongoing work to properly support NVFP4 via CUTLASS for SM121 which is the “right way”. Marlin is great as a compatibility layer but it’s not the best for performance.

If we’re interested in performance, we definitely do not want to dequantize to BF16 since then we have to perform operations on 4x more data – which is a big performance penalty with the Spark’s limited memory bandwidth. That’s why all of the marketing for Spark is built around NVFP4 estimates…

But all of this also ties back to that, NVFP4 is currently not the best option on the DGX Spark, which is a shame, and I’m optimistic it’s going to get better – but in the meantime, other quants like AWQ or Intel Autoround are better. But once NVFP4 finally gets better (hopefully it’ll start to shape up alongside the next few CUDA iterations with transition towards tiles), it will help a lot. And at that point, you definitely won’t want to be using marlin.

1 Like

Just a question. I am new so sorry for bothering my Spark came just today. what is the best Model then in terms ouf simple intelligence/Qualtiy for Coding then if its not NVFP4? would it be something like the qwen 3.5 122B Int4 autoround from Intel? or something else? Coder next on FP8 maybe? or something completely different? i just want the highest qulaity coding Model to use with Opencode and OpenwebUI - sorry that its completely unrelated to the discussion. I am new and curious…

I’m mostly looking for what the best performing option is today. nvfp4 is marketed as maintaining better quality vs. quantization; how good is nvfp4 vs. int4(autoround) in terms of quality? When running with marlin backend, what is the token rate performance of nvfp4 vs. int4 in typical recipe? I’m looking for the best option to run today, I’m not capable of hacking on cutlass to bolt-on support on my own. Once better “official” support comes along, I’d happily switch to it.

Varies model by model and quant by quant. NVFP4 is just 4-bit quantization unless the model was natively trained in NVFP4. It’s not inherently “better” – but it is supposed to be natively supported – that implied performance benefit would be the difference. Quantization quality depends on quality of calibration and task alignment.

For Nemotron models, I’d stick with the NVFP4.

For Qwen3.5 models, I’d stick with int4 autoround for now.

Unfortunately the answer is both dependent on what you’re doing and constantly changing.

That’s why we publish recipes and why I make sparkrun – because we need to be able to jump around between different models, settings, etc. quickly and easily…

4 Likes

Yeah that doesn’t really have much to do with whether a model is NVFP4 or any other quant, but that’s OK!

  1. MiniMax-M2.5 AWQ is pretty solid if you have 2+ sparks.
  2. Qwen 3.5 122B is strong. int4 autoround for 1-2 sparks. awq4 for 4 sparks.
  3. Qwen3 Coder Next FP8 or int4 intel autoround is also good.

All of those models work pretty well in terms of the quants. The differences between them have more to do with the model designs and their training.

I also do like Gemma4 26B MoE awq quant, but I think it needs some SFT to really use it as a coding model. Anyway I rambled a bit. Hope it helps. One of the best ways to go about this stuff is to just try them. If you use sparkrun (I’m biased), it’s quick enough to just keep swapping models. Try a few.

indeed, my current choice is to use 397B int4 until something better for my use comes along.

Yeah I was just doing some testing last night and was finding that Qwen3.5 122B wasn’t bad but it’s still kind of weak compared to 397B. (I was using an AWQ quant with MTP on TP4 and ROPEx2 so 512K context).

The 122B was fine enough but was a bit weak on debugging or more complicated architectural understanding. (Probably fine doing grunt work on well-defined tasks.)

397B is still a bit shy of what I want it to be, but it was almost there. It was relatively similar to Opus 4.6 for the task I was playing with as a qualitative eval. (Although a lot slower since even w/ 4 nodes, 397B (A17B) AWQ via marlin kernel wasn’t shockingly fast… 122B was pretty comfortable on speed actually, 397B felt slow… but it wasn’t so slow to be unusable).

I didn’t really do a complicated eval – possible that I need to do my own quant to try to calibrate it to work well for those longer context lengths. (I aim for 512k context because I feel like I need it to be able to “hang out” and be productive in the 200-400k context range).

end rambling.

Quick note , this post is Part 2 of what I originally planned as a 4-part series. Part 1 — “Memory Creep on DGX Spark: Where Your 128 GB Actually Goes”: Memory Creep on DGX Spark: Where Your 128 GB Actually Goes (And How to Stop It) . It’s the “where does the RAM go” diagnostic that this Marlin post sits on top of.

The remaining pieces (vLLM single-user tuning, full benchmark matrix, plus some adjacent work — NVFP4 guide, mamba_dev, TurboQuant KV cache experiments) I moved to GitHub rather than continuing the forum series: GitHub - Sggin1/spark-ai-containers: Docker containers for AI models on NVIDIA DGX Spark (GB10, SM121, aarch64). TurboQuant KV cache compression + mamba-ssm aarch64 build. · GitHub . More of a side project I can’t always engage with

On its own, “broken” was my opinion — failing to initialize on SM121 is specific and fixable, not a blanket indictment. Active vLLM #35947 will change the right answer over time. The numbers here are one model, one build, one point in time. “Marlin wins on SM121” holds for that window, not forever — as pfnguyen and dbsci both noted, the picture is dynamic.

Appreciate the engagement.

FYI: fix: Software E2M1 conversion for SM12x NVFP4 activation quantization by blake-snc · Pull Request #35947 · vllm-project/vllm · GitHub was superseded by https://github.com/vllm-project/vllm/pull/37725, which was merged a while ago.

1 Like

Just catching up after coming back from my travel. The reason was reported (and confirmed by me) quality loss using Marlin NVFP4 backend. It was probably temporary, but CUTLASS is pretty stable, and since a few PRs have now been merged into Flashinfer, I’m probably going to switch from vLLM CUTLASS to Flashinfer CUTLASS (need to run a few tests first).

2 Likes

Thanks for the pointer to #37725 , that’s the piece I was missing. Marlin was working around that, not the kernel architecture. Mostly on other projects the last couple weeks, hope to rebuild with a current vLLM and re-check whether CUTLASS now works correctly on SM121 post-fix, when time permits, lol. Appreciate the correction.

2 Likes

Good context, and appreciate the insights, thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.