Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark)

sggin1 · March 30, 2026, 5:50pm

# Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark)

Sharing some more results from testing NVFP4 on the Spark — this time digging into *why* the default backend is broken and how the Marlin fix works.

**TL;DR:** Set 3 environment variables and NVFP4 goes from broken/slow to **50 tok/s on DGX Spark**. Marlin is 16% faster and uses 7 GB less memory than the default FlashInfer path.

—

## The Problem

NVFP4 models on DGX Spark (SM121) are using **broken CUTLASS kernels**. You might not even know it — vLLM doesn’t crash, it silently falls back to slower codepaths that use more memory and run 16% slower.

Check your vLLM logs. If you see this, you’re affected:

```

[Autotuner]: Skipping tactic … due to failure while profiling:

[TensorRT-LLM][ERROR] Failed to initialize cutlass TMA WS grouped gemm

```

## Why It’s Broken

SM121 (DGX Spark GB10) lacks `tcgen05` tensor core instructions that datacenter Blackwell (SM100/SM110) has. vLLM’s backend auto-selection picks `FLASHINFER_CUTLASS` because SM121 has capability >= 100:

```python

# vLLM source: nvfp4_utils.py lines 59-64

if current_platform.has_device_capability(100) and has_flashinfer():

backend = NvFp4LinearBackend.FLASHINFER_CUTLASS  # ← BROKEN on sm_121!

```

The CUTLASS FP4 kernels generate `cvt with .e2m1x2` PTX instructions not supported on SM121. The autotuner detects this and skips the broken tactics, falling back to whatever works — which is slower and uses more memory.

## The Fix: 3 Environment Variables

```bash

VLLM_USE_FLASHINFER_MOE_FP4=0

VLLM_NVFP4_GEMM_BACKEND=marlin

VLLM_TEST_FORCE_FP8_MARLIN=1

```

That’s it. This forces the **Marlin backend**, which dequantizes FP4 to BF16 on the fly using operations that work correctly on SM121. Marlin only needs capability >= 75 (Turing), so SM121 is well supported.

Your vLLM logs should now show:

```

Using NvFp4LinearBackend.MARLIN for NVFP4 GEMM

Using ‘MARLIN’ NvFp4 MoE backend out of potential backends: [‘VLLM_CUTLASS’, ‘MARLIN’]

```

## What Each Variable Does

| Variable | Value | Purpose |

|----------|-------|---------|

| `VLLM_USE_FLASHINFER_MOE_FP4` | `0` | Disables FlashInfer’s FP4 MoE kernel path |

| `VLLM_NVFP4_GEMM_BACKEND` | `marlin` | Forces Marlin for all NVFP4 linear layers |

| `VLLM_TEST_FORCE_FP8_MARLIN` | `1` | Also routes FP8 operations through Marlin |

## Benchmark Proof

Tested on DGX Spark GB10 with Nemotron-3-Nano-30B-A3B-NVFP4 (19 GB model), identical settings except backend:

|---------|:------:|:-----:|-------|

| **Marlin** | **32 GB** | **50.0** | Clean, no errors |

| FlashInfer (default) | 39 GB | 42.6 | CUTLASS errors in log, falls back |

Marlin: **16% faster, 7 GB less memory, zero errors.**

## Full Launch Command

```bash

docker run -d --runtime=nvidia \

--name nemotron-nvfp4 \\

-v /path/to/hf-cache:/root/.cache/huggingface \

-p 8000:8000 \\

-e VLLM_USE_FLASHINFER_MOE_FP4=0 \

-e VLLM_NVFP4_GEMM_BACKEND=marlin \\

-e VLLM_TEST_FORCE_FP8_MARLIN=1 \

vllm-node:latest \\

python3 -m vllm.entrypoints.openai.api_server \

    --host 0.0.0.0 --port 8000 \\

--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \

    --enforce-eager \\

--gpu-memory-utilization 0.2 \

    --max-model-len 8192 \\

--kv-cache-dtype fp8 \

    --trust-remote-code

```

## Does This Apply to Other NVFP4 Models?

Yes — any NVFP4/ModelOpt FP4 model running on SM121 (DGX Spark) or SM120 (RTX 5090, RTX PRO 6000) should benefit from the Marlin backend. The CUTLASS FP4 kernel issue affects all consumer Blackwell GPUs that lack `tcgen05`.

Models we’ve seen reported affected:

- Nemotron-3-Nano-30B-A3B-NVFP4

- Nemotron-3-Super-120B-A12B-NVFP4

- Qwen3-VL-235B-A22B-NVFP4

- Qwen3.5-122B-A10B-NVFP4

- GLM-4.7-Flash-NVFP4

## When Will Native FP4 Work on SM121?

No timeline from NVIDIA. Active PRs:

- CUTLASS #3038: SM121-gated MXFP4 kernel wiring

- vLLM #35947: Software E2M1 conversion for SM12x

- vLLM #38126: Architecture suffix preservation (merged)

Until native support lands, Marlin is the recommended path. It’s not using native FP4 tensor cores (it dequantizes to BF16), but it’s still faster than the broken CUTLASS fallback and delivers the full memory savings of the NVFP4 checkpoint format.

## Credit

The Marlin backend discovery came from the DGX Spark community:

- [We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ!]( We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! )

—

*Tested March 26, 2026 — DGX Spark GB10, CUDA 13.2, Driver 580.142, vLLM 0.18.1rc1 (eugr build)*

sggin1 · March 31, 2026, 3:00am

continued testing since this post with compiled findings and configs into a GitHub repo for easier reference:

Includes the Marlin fix from this thread, plus TurboQuant KV cache compression (240K context at 64 GB) and a working mamba-ssm build for aarch64.

quocble · April 7, 2026, 10:14pm

this is gold. thank you for sharing.

pfnguyen · April 10, 2026, 7:07pm

This is interest info, why did @eugr ‘s spark-vllm-docker switch the NVFP4 models back over to CUTLASS backend from using MARLIN before? (about 3 weeks ago).

dbsci · April 10, 2026, 7:36pm

I’m going from memory of this but…

(1) there was actually a specific performance regression in marlin at the time (this stuff changes on a nightly basis) – so at the time the immediate change was made – it was done to protect people from a performance regression

(2) there is recently a lot of ongoing work to properly support NVFP4 via CUTLASS for SM121 which is the “right way”. Marlin is great as a compatibility layer but it’s not the best for performance.

If we’re interested in performance, we definitely do not want to dequantize to BF16 since then we have to perform operations on 4x more data – which is a big performance penalty with the Spark’s limited memory bandwidth. That’s why all of the marketing for Spark is built around NVFP4 estimates…

But all of this also ties back to that, NVFP4 is currently not the best option on the DGX Spark, which is a shame, and I’m optimistic it’s going to get better – but in the meantime, other quants like AWQ or Intel Autoround are better. But once NVFP4 finally gets better (hopefully it’ll start to shape up alongside the next few CUDA iterations with transition towards tiles), it will help a lot. And at that point, you definitely won’t want to be using marlin.

djordjestojanovic1992 · April 10, 2026, 8:01pm

Just a question. I am new so sorry for bothering my Spark came just today. what is the best Model then in terms ouf simple intelligence/Qualtiy for Coding then if its not NVFP4? would it be something like the qwen 3.5 122B Int4 autoround from Intel? or something else? Coder next on FP8 maybe? or something completely different? i just want the highest qulaity coding Model to use with Opencode and OpenwebUI - sorry that its completely unrelated to the discussion. I am new and curious…

pfnguyen · April 10, 2026, 8:11pm

I’m mostly looking for what the best performing option is today. nvfp4 is marketed as maintaining better quality vs. quantization; how good is nvfp4 vs. int4(autoround) in terms of quality? When running with marlin backend, what is the token rate performance of nvfp4 vs. int4 in typical recipe? I’m looking for the best option to run today, I’m not capable of hacking on cutlass to bolt-on support on my own. Once better “official” support comes along, I’d happily switch to it.

dbsci · April 10, 2026, 8:26pm

Varies model by model and quant by quant. NVFP4 is just 4-bit quantization unless the model was natively trained in NVFP4. It’s not inherently “better” – but it is supposed to be natively supported – that implied performance benefit would be the difference. Quantization quality depends on quality of calibration and task alignment.

For Nemotron models, I’d stick with the NVFP4.

For Qwen3.5 models, I’d stick with int4 autoround for now.

Unfortunately the answer is both dependent on what you’re doing and constantly changing.

That’s why we publish recipes and why I make sparkrun – because we need to be able to jump around between different models, settings, etc. quickly and easily…

dbsci · April 10, 2026, 11:50pm

Yeah that doesn’t really have much to do with whether a model is NVFP4 or any other quant, but that’s OK!

MiniMax-M2.5 AWQ is pretty solid if you have 2+ sparks.
Qwen 3.5 122B is strong. int4 autoround for 1-2 sparks. awq4 for 4 sparks.
Qwen3 Coder Next FP8 or int4 intel autoround is also good.

All of those models work pretty well in terms of the quants. The differences between them have more to do with the model designs and their training.

I also do like Gemma4 26B MoE awq quant, but I think it needs some SFT to really use it as a coding model. Anyway I rambled a bit. Hope it helps. One of the best ways to go about this stuff is to just try them. If you use sparkrun (I’m biased), it’s quick enough to just keep swapping models. Try a few.

pfnguyen · April 11, 2026, 5:56pm

indeed, my current choice is to use 397B int4 until something better for my use comes along.

dbsci · April 11, 2026, 6:24pm

Yeah I was just doing some testing last night and was finding that Qwen3.5 122B wasn’t bad but it’s still kind of weak compared to 397B. (I was using an AWQ quant with MTP on TP4 and ROPEx2 so 512K context).

The 122B was fine enough but was a bit weak on debugging or more complicated architectural understanding. (Probably fine doing grunt work on well-defined tasks.)

397B is still a bit shy of what I want it to be, but it was almost there. It was relatively similar to Opus 4.6 for the task I was playing with as a qualitative eval. (Although a lot slower since even w/ 4 nodes, 397B (A17B) AWQ via marlin kernel wasn’t shockingly fast… 122B was pretty comfortable on speed actually, 397B felt slow… but it wasn’t so slow to be unusable).

I didn’t really do a complicated eval – possible that I need to do my own quant to try to calibrate it to work well for those longer context lengths. (I aim for 512k context because I feel like I need it to be able to “hang out” and be productive in the 200-400k context range).

end rambling.

sggin1 · April 12, 2026, 5:27am

Quick note , this post is Part 2 of what I originally planned as a 4-part series. Part 1 — “Memory Creep on DGX Spark: Where Your 128 GB Actually Goes”: Memory Creep on DGX Spark: Where Your 128 GB Actually Goes (And How to Stop It) . It’s the “where does the RAM go” diagnostic that this Marlin post sits on top of.

The remaining pieces (vLLM single-user tuning, full benchmark matrix, plus some adjacent work — NVFP4 guide, mamba_dev, TurboQuant KV cache experiments) I moved to GitHub rather than continuing the forum series: GitHub - Sggin1/spark-ai-containers: Docker containers for AI models on NVIDIA DGX Spark (GB10, SM121, aarch64). TurboQuant KV cache compression + mamba-ssm aarch64 build. · GitHub . More of a side project I can’t always engage with

On its own, “broken” was my opinion — failing to initialize on SM121 is specific and fixable, not a blanket indictment. Active vLLM #35947 will change the right answer over time. The numbers here are one model, one build, one point in time. “Marlin wins on SM121” holds for that window, not forever — as pfnguyen and dbsci both noted, the picture is dynamic.

Appreciate the engagement.

serapis · April 12, 2026, 6:15am

FYI: fix: Software E2M1 conversion for SM12x NVFP4 activation quantization by blake-snc · Pull Request #35947 · vllm-project/vllm · GitHub was superseded by https://github.com/vllm-project/vllm/pull/37725, which was merged a while ago.

eugr · April 12, 2026, 6:58am

Just catching up after coming back from my travel. The reason was reported (and confirmed by me) quality loss using Marlin NVFP4 backend. It was probably temporary, but CUTLASS is pretty stable, and since a few PRs have now been merged into Flashinfer, I’m probably going to switch from vLLM CUTLASS to Flashinfer CUTLASS (need to run a few tests first).

sggin1 · April 12, 2026, 4:39pm

Thanks for the pointer to #37725 , that’s the piece I was missing. Marlin was working around that, not the kernel architecture. Mostly on other projects the last couple weeks, hope to rebuild with a current vLLM and re-check whether CUTLASS now works correctly on SM121 post-fix, when time permits, lol. Appreciate the correction.

sggin1 · April 12, 2026, 4:50pm

Good context, and appreciate the insights, thanks!

system · April 26, 2026, 4:50pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2343	December 25, 2025
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6855	March 28, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	231	10022	April 21, 2026
Your GPU does not have native support for FP4 computation but FP4 quantization is being used DGX Spark / GB10	5	1356	January 30, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	214	5294	March 27, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2869	December 31, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1287	February 13, 2026
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	8622	March 31, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4286	February 27, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2561	March 26, 2026

Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark)

Related topics