PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

trystan1 · March 30, 2026, 12:50pm

FWIW:

=== GPQA Diamond ===
base_url: http://gb10:8000/v1
model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4
questions: 198
repeats: 5
total eval calls: 990
score (all repeats): 0.7545 (75.45%)
correct / total: 747 / 990
failed requests: 0
prompt tokens total: 272,511
completion tokens total: 16,691,185
total tokens: 16,963,696
avg tokens / call: 17135.0
wall time (s): 18433.3

Full weights received 76.1 based on the model card: nvidia/Nemotron-Cascade-2-30B-A3B · Hugging Face

Honestly very cool :)

5 hours and 17 million tokens later, stability and accuracy :)

vedcsolution · March 30, 2026, 1:05pm

With which pr do we construct the image there are several pr. thanks

jl121 · March 30, 2026, 1:07pm

Johnnys guide here works

trystan1 · March 30, 2026, 1:18pm

If you want to build as normal, clone flashinfer from source, vllm from source then merge in these two PRs into their respective main branches

github.com/flashinfer-ai/flashinfer

[NVIDIA] fix(jit): enable GDC for CUTLASS fused MoE PDL — prevent random crashes on SM12x (#2913)

main ← johnnynunez:main

opened 10:48PM - 29 Mar 26 UTC

johnnynunez

+24 -0

### Summary - Add missing `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` compile flag to …all CUTLASS fused MoE JIT modules (SM100/SM103/SM120) and `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` to SM90 modules - Sync nv_internal `grid_dependency_control.h` with upstream CUTLASS to support SM100/SM103/SM110/SM120/SM121 GDC - Add `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` to FP8 blockscale GEMM SM90 module ### Problem Random `cudaErrorIllegalInstruction` crashes on DGX Spark (SM121) and RTX 50-series (SM120) when running NVFP4 MoE models (e.g., Nemotron, Qwen3.5-122B) under load. The crashes are intermittent and worsen with longer context lengths and higher concurrency. **Root cause:** PR #2780 fixed the missing GDC compile flags for GEMM modules (`flashinfer/jit/gemm/core.py`), but the **CUTLASS fused MoE modules** in `flashinfer/jit/fused_moe.py` and the **FP8 blockscale GEMM module** were not fixed. This is the exact same class of bug as #2708. Without `-DCUTLASS_ENABLE_GDC_FOR_SM100=1`, CUTLASS's `grid_dependency_control.h` compiles `wait_on_dependent_grids()` and `launch_dependent_grids()` as **empty no-ops**: ```cpp CUTLASS_DEVICE void wait_on_dependent_grids() { #if (defined(CUTLASS_GDC_ENABLED)) // ← not defined without the flag asm volatile("griddepcontrol.wait;"); #endif } ``` Meanwhile, the host-side code still sets `programmaticStreamSerializationAllowed = true` (PDL enabled) via `device_support_pdl()` which returns `True` for all `major >= 9`, including SM12x. This means: 1. **Host enables PDL** → CUDA runtime overlaps consecutive kernels 2. **Device GDC barriers are no-ops** → No synchronization between overlapping kernels 3. **Race condition** → Dependent kernel reads stale global memory → corruption → `cudaErrorIllegalInstruction` The crash is random because it depends on exact kernel scheduling timing, which varies per request. ### Fix **`flashinfer/jit/fused_moe.py`** — Added GDC flags to all CUTLASS fused MoE modules: | Module | Flag | Architectures Covered | |---|---|---| | `gen_cutlass_fused_moe_sm120_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` | SM120, SM121 | | `gen_cutlass_fused_moe_sm103_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` | SM103, SM120, SM121 | | `gen_cutlass_fused_moe_sm100_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` | SM100, SM110, SM120, SM121 | | `gen_cutlass_fused_moe_sm90_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` | SM90 | | `gen_trtllm_gen_fused_moe_sm100_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` | SM100+, SM120, SM121 | **`flashinfer/jit/gemm/fp8_blockscale.py`** — Added `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` to `gen_fp8_blockscale_gemm_sm90_module()`. **`csrc/nv_internal/.../grid_dependency_control.h`** — Synced with upstream CUTLASS (`3rdparty/cutlass/include/cutlass/arch/grid_dependency_control.h`) to add SM100+ GDC support. Previously only handled SM90, so any nv_internal TensorRT-LLM code compiled for SM12x would have GDC barriers silently compiled as no-ops. ### Why `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` covers SM12x CUTLASS uses a single flag for the entire Blackwell family. From `grid_dependency_control.h`: ```cpp #if(CUDA_BARRIER_ENABLED && defined(CUTLASS_ENABLE_GDC_FOR_SM100) && defined(__CUDA_ARCH__) && \ ((__CUDA_ARCH__ == 1000 && ...) || // SM100 (__CUDA_ARCH__ == 1030 && ...) || // SM103 (__CUDA_ARCH__ == 1100 && ...) || // SM110 (__CUDA_ARCH__ == 1200 && ...) || // SM120 (RTX 50-series) (__CUDA_ARCH__ == 1210 && ...))) // SM121 (DGX Spark) #define CUTLASS_GDC_ENABLED ``` ### Why SM90 GDC flag was NOT added to SM100+ modules PR #2716 attempted to add both `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` and `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` to all modules. It broke AOT builds because `sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp` checks `CUTLASS_ENABLE_GDC_FOR_SM90` and calls `scheduler.is_last_tile()` — a method not present on the SM120 scheduler. PR #2780 corrected this by using only the SM100 flag for SM100+ modules. This PR follows the same approach. ### Related - #2708 — Original issue: missing GDC flags cause PDL race condition - #2716 — First fix attempt (reverted — broke AOT) - #2780 — Corrected fix for GEMM modules only - [vllm-project/vllm#38423](https://github.com/vllm-project/vllm/pull/38423) — NVFP4 bugfix on DGX Spark - [NVIDIA/cutlass#3121](https://github.com/NVIDIA/cutlass/pull/3121) — K=64 block-scaled GEMM tiles (separate issue) ### Test plan - [x] Clear JIT cache: `rm -rf ~/.cache/flashinfer/` - [x] Run NVFP4 MoE model on SM121 (DGX Spark) with 128K context under load — verify no `cudaErrorIllegalInstruction` - [x] Run NVFP4 MoE model on SM120 (RTX 50-series) with concurrent requests — verify no NaN/garbage output - [x] Verify `CUDA_LAUNCH_BLOCKING=1` workaround is no longer needed - [x] AOT build with `FLASHINFER_CUDA_ARCH_LIST="12.1a"` completes without errors - [x] SM90 (Hopper) fused MoE tests pass: `pytest tests/moe/` - [x] SM100 GEMM tests still pass (no regression from existing GDC flags) ## Summary by CodeRabbit * **New Features** * Expanded GPU kernel compilation support: enabled additional optimizations for NVIDIA SM100 and SM90 GPUs, activating dependency-control optimizations where available. * Updated JIT/GEMM build configs to include these architecture-specific compile options, improving performance and compatibility on supported hardware.

github.com/vllm-project/vllm

[NVIDIA] Bugfix NVFP4 DGX Spark and RTX50 (#38423)

main ← johnnynunez:main

opened 07:58AM - 28 Mar 26 UTC

johnnynunez

+86 -20

## Summary Fix `cudaErrorIllegalInstruction` when running NVFP4 models (e.g. …`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`) on SM12x GPUs (RTX 50 series SM120, DGX Spark SM121). ### Root causes 1. **CUTLASS v4.2.2 lacks SM12x NVFP4 tile constraints** — The bundled CUTLASS was missing SM120f family-level compilation support for NVFP4/MX Grouped GEMM and SM121-specific tile configurations (DGX Spark). This caused `IllegalInstruction` during decode when small-M tile variants were selected. Related upstream: [NVIDIA/cutlass#3038](https://github.com/NVIDIA/cutlass/pull/3038). 2. **FlashInfer 0.6.6 bundles CUTLASS 4.2.1** — The FlashInfer CUTLASS MoE backend failed on SM12x with `Failed to initialize cutlass TMA WS grouped gemm` due to the same missing tile constraints. Fixed upstream in [flashinfer-ai/flashinfer#2798](https://github.com/flashinfer-ai/flashinfer/pull/2798). 3. **`cutlass_scaled_mm_supports_fp4()` reported false availability** — Only checked CUDA runtime version (`>= 12080`), not whether the SM-specific kernel was actually compiled. On a build with only `ENABLE_NVFP4_SM100`, it incorrectly reported CUTLASS as available for SM12x, then failed at dispatch. 4. **Quantization kernels had no SM runtime guard** — The `scaled_fp4_quant`, `silu_and_mul_nvfp4_quant`, and expert quant entry points dispatched to `_sm1xxa` kernels if *any* SM1xx was compiled, with no runtime check. If only SM100 SASS existed, CUDA would JIT-compile SM100 PTX for SM120 (different major arch), producing illegal instructions asynchronously — surfacing later at `synchronize()` as an opaque CUDA error. 5. **FlashInfer CUTLASS backend bypassed quant kernel checks** — `select_nvfp4_linear_backend()` selected FlashInfer CUTLASS solely on `has_device_capability(100)`, without verifying the vLLM quantization kernels (used by all non-Marlin backends) were compiled for the current SM. ### Changes | File | Change | |---|---| | `CMakeLists.txt` | Bump CUTLASS from v4.2.2 to **v4.4.2** — enables SM120f (family) compilation for NVFP4/MX Grouped GEMM, covering RTX 50 (SM120) and DGX Spark (SM121) | | `docker/Dockerfile` | Bump FlashInfer from 0.6.6 to **0.6.7** (includes CUTLASS 4.4.2, fixes TMA grouped GEMM on SM12x) | | `docker/Dockerfile.nightly_torch` | Same FlashInfer bump (source build) | | `docker/versions.json` | `FLASHINFER_VERSION`: `0.6.6` → `0.6.7` | | `nvfp4_scaled_mm_entry.cu` | `cutlass_scaled_mm_supports_fp4()` now checks compile-time `ENABLE_NVFP4_SM100`/`ENABLE_NVFP4_SM120` guards per SM range instead of a blanket `>= 100` check | | `nvfp4_quant_entry.cu` | Added `nvfp4_quant_sm_supported()` runtime guard to all four quant entry points (`scaled_fp4_quant`, `scaled_fp4_experts_quant`, `silu_and_mul_nvfp4_quant`, `silu_and_mul_scaled_fp4_experts_quant`) | | `nvfp4_utils.py` | `select_nvfp4_linear_backend()` gates FlashInfer CUTLASS on `cutlass_fp4_supported()` + adds validation assert for all FlashInfer backends | ### What is NOT changed **Marlin remains a valid fallback on SM12x.** Marlin FP4 uses weight-only dequantization to BF16 — it does not use native FP4 tensor core instructions and works correctly on all Blackwell architectures including DGX Spark. Benchmarks confirm Marlin is stable on SM121 (~558 tok/s, on par with vLLM CUTLASS at ~562 tok/s). The Marlin path (`apply_fp4_marlin_linear`) bypasses the vLLM quant kernels entirely, so the SM guards in `nvfp4_quant_entry.cu` do not affect it. ### Behavior on SM12x after this PR | Scenario | Before | After | |---|---|---| | Build includes `ENABLE_NVFP4_SM120` + CUTLASS v4.4.2 | `IllegalInstruction` | Native CUTLASS backend selected, works correctly | | Build lacks `ENABLE_NVFP4_SM120` | `IllegalInstruction` (SM100 PTX JIT to SM120) | Native CUTLASS correctly reports unavailable; **Marlin selected as fallback** — works correctly | | FlashInfer CUTLASS MoE on SM12x | `Failed to initialize cutlass TMA WS grouped gemm` (CUTLASS 4.2.1 in FlashInfer 0.6.6) | Works correctly with FlashInfer 0.6.7 (CUTLASS 4.4.2) | ### Follow-up: FlashInfer 0.6.8 [flashinfer-ai/flashinfer#2738](https://github.com/flashinfer-ai/flashinfer/pull/2738) (merged March 28, 2026) adds native NVFP4 and MXFP4 group GEMM support for SM120/SM121 (RTX 50 / DGX Spark) directly in FlashInfer. This will land in FlashInfer **0.6.8**. Once released, `FLASHINFER_VERSION` should be bumped in `docker/Dockerfile`, `docker/Dockerfile.nightly_torch`, and `docker/versions.json` to unlock FlashInfer's own SM12x NVFP4/MXFP4 kernels (including GDC unguarding and PDL group GEMM fixes). TODO comments have been added to both Dockerfiles tracking this. ## Test plan - [x] Build with `CUDA_ARCHS="12.0a;12.1a"` on DGX Spark (SM121), verify NVFP4 model serves with vLLM CUTLASS backend (`VLLM_NVFP4_GEMM_BACKEND=cutlass --moe-backend=cutlass`) - [x] Verify FlashInfer CUTLASS MoE on SM12x no longer hits TMA init error - [x] Build with `CUDA_ARCHS="12.0a;12.1a"`, verify Marlin fallback still works (`VLLM_NVFP4_GEMM_BACKEND=marlin --moe-backend=marlin`) - [x] Build with `CUDA_ARCHS="10.0a"` only, verify Marlin fallback on SM12x (no `IllegalInstruction`) - [x] Verify SM100 (B200) still works with native CUTLASS (no regression from CUTLASS bump) - [x] Verify SM89/SM90 still works (pre-Blackwell unaffected) - [x] Run `tests/models/quantization/test_nvfp4.py` on SM120+ - [x] Docker build completes with FlashInfer 0.6.7 for both `Dockerfile` and `Dockerfile.nightly_torch`

from the vllm directory after merging:

uv venv --python 3.12 --seed .venv

source .venv/bin/activate

export TORCH_CUDA_ARCH_LIST=12.1a

uv pip install -ve .  --torch-backend=auto --refresh

that’ll take a bit then

uv uninstall flashinfer-cubin flashinfer-python

from the flashinfer directory after merging:

uv pip install --no-build-isolation -e .

when cloning the source of flashinfer make sure you do it recursively since there are submodules for cutlass.

Johnny’s guide has a step that rm -rfs your .cache folder wholesale but the directories to purge before the flashinfer JIT/starting vllm are:

rm -rf ~/.cache/vllm

rm -rf ~/.cache/flashinfer

rm -rf ~/.triton

rm -rf ~/.config/vllm

If you don’t want to nuke all your models :)

Digital_David · March 30, 2026, 3:27pm

Can you post your vLLM serve settings? Would love to give Cascade a try as well.

vedcsolution · March 30, 2026, 3:29pm

~/spark-vllm-docker$ uvx llama-benchy   --base-url http://127.0.0.1:8000/v1   --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 --pp 2048   --depth 4096 16000 32000
Installed 49 packages in 30ms
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.5)
Date: 2026-03-30 17:26:46
Benchmarking model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 at http://127.0.0.1:8000/v1
Concurrency levels: [1]
Downloading book from https://www.gutenberg.org/files/1661/1661-0.txt...
Saved text to cache: /home/csolutions_ai/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 162015
Warming up...
Warmup (User only) complete. Delta: 16 tokens (Server: 38, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)

Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 2.35 ms
Running test: pp=2048, tg=32, depth=4096, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Printing results in MD format:



| model                                       |            test |               t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:--------------------------------------------|----------------:|------------------:|-------------:|------------------:|------------------:|------------------:|
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |  pp2048 @ d4096 | 5909.41 ± 3864.42 |              | 5085.66 ± 6183.45 | 5083.31 ± 6183.45 | 5085.71 ± 6183.45 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |    tg32 @ d4096 |      56.28 ± 0.04 | 58.10 ± 0.04 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 @ d16000 |  7660.58 ± 529.70 |              |  2370.16 ± 172.01 |  2367.81 ± 172.01 |  2370.22 ± 172.00 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |   tg32 @ d16000 |      56.46 ± 0.09 | 58.28 ± 0.10 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 @ d32000 |  7263.28 ± 169.66 |              |  4692.55 ± 111.45 |  4690.20 ± 111.45 |  4692.61 ± 111.46 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |   tg32 @ d32000 |      56.54 ± 0.10 | 58.37 ± 0.11 |                   |                   |                   |

llama-benchy (0.3.5)
date: 2026-03-30 17:26:46 | latency mode: api
~/spark-vllm-docker$

buying my Nvidia cap,

trystan1 · March 30, 2026, 3:38pm

before you run these with the JIT flashinfer build you’ll want to set:

export MAX_JOBS=8 to prevent compilation from OOM

also uv pip install fastsafetensors

and

sudo bash -c “echo 8192 > /sys/block/nvme0n1/queue/read_ahead_kb”

if you want the weight loading to scream

vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 --kv-cache-dtype fp8 --trust-remote-code --gpu-memory-utilization 0.85 --max-num-seqs 512 --enable-prefix-caching --max-cudagraph-capture-size 512 --mamba-ssm-cache-dtype float32 --reasoning-parser nemotron_v3 --tool-call-parser qwen3_coder --enable-auto-tool-choice --port 8000 --host 0.0.0.0 --load-format fastsafetensors

That was what was used in the gpqa accuracy run I posted previously.

I’m comparing the impact of changing --mamba-ssm-cache-dtype float16 accuracy wise with another run currently to see if there’s a huge accuracy falloff (relative good performance bump with that but time will tell if the model turns into a potato)

if that goes alright i’m going to take it a little further with --mamba-cache-dtype down to float16 so both the ssm state and conv state get the same treatment and further see if there’s an impact.

Considering all the hot models right now are mamba/causal conv1d I’d really like to see firsthand the impact.

vedcsolution · March 30, 2026, 3:45pm

vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --mamba_ssm_cache_dtype float32 \
    --max-model-len 262144 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nemotron_v3 \
    --kv-cache-dtype fp8

$ uvx llama-benchy   --base-url http://127.0.0.1:8000/v1   --model chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 --pp 2048   --depth 4096 16000 32000 128000
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.5)
Date: 2026-03-30 17:47:43
Benchmarking model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 at http://127.0.0.1:8000/v1
Concurrency levels: [1]
Loading text from cache: /home/csolutions_ai/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 143827
Warming up...
Warmup (User only) complete. Delta: 33 tokens (Server: 55, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)

Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 5.59 ms
Running test: pp=2048, tg=32, depth=4096, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=128000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Printing results in MD format:



| model                                      |             test |               t/s |        peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:-------------------------------------------|-----------------:|------------------:|----------------:|------------------:|------------------:|------------------:|
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   pp2048 @ d4096 | 6178.99 ± 1076.38 |                 |  1028.80 ± 166.52 |  1023.21 ± 166.52 |  1029.00 ± 166.69 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |     tg32 @ d4096 |      59.47 ± 1.00 |    61.71 ± 0.81 |                   |                   |                   |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |  pp2048 @ d16000 |  7368.29 ± 503.82 |                 |  2467.01 ± 176.24 |  2461.42 ± 176.24 |  2467.08 ± 176.27 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |    tg32 @ d16000 |      57.85 ± 1.38 |    60.52 ± 0.44 |                   |                   |                   |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |  pp2048 @ d32000 |   7334.35 ± 52.13 |                 |   4648.09 ± 32.89 |   4642.50 ± 32.89 |   4648.15 ± 32.90 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |    tg32 @ d32000 |      56.24 ± 2.23 |   74.63 ± 21.18 |                   |                   |                   |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d128000 |   4709.97 ± 20.59 |                 | 27455.18 ± 249.64 | 27611.77 ± 120.71 | 27618.46 ± 120.97 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   tg32 @ d128000 |   236.89 ± 115.28 | 540.34 ± 282.67 |                   |                   |                   |

llama-benchy (0.3.5)
date: 2026-03-30 17:47:43 | latency mode: api

I haven’t done what @trystan1 recommended; it might explode.

trystan1 · March 30, 2026, 3:50pm

The only thing I’m currently recommending would be the purchase of a high quality nvidia baseball cap.

thomas.developer1 · March 30, 2026, 5:17pm

The day NVIDIA finally gets NVFP4 officially working on this device (if ever), I’ll consider whether they belong on the good-guys list again. NVFP4 is essential for the DGX Spark, and honestly, it should have been ready when the Spark launched.

Until then, I’d much rather wear a community hat. The people here who invested their time and did the hard work to find workarounds when NVIDIA failed to pull its own weight are the ones who deserve the credit. You guys are amazing.

trystan1 · March 30, 2026, 5:19pm

My guess is there will be a big push to get nvfp4 performance on gb10 between now and the shipping of the n1x systems/laptops later.

It would make sense for nvidia to make the gb10 the dev platform/laptop platform of choice for cuda.

Pure speculation on my part, but seems reasonable.

eugr · March 30, 2026, 5:24pm

The performance seems to be lower than both Marlin and VLLM_CUTLASS though.

eugr · March 30, 2026, 5:27pm

Looks like vLLM PR #38423 got merged, so only FI #2913 is left to merge. I’ll run my build with FI #2913 applied and if it solves the issue, then the next community build will include these changes.

grindstone · March 30, 2026, 5:31pm

vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 512 \
  --enable-prefix-caching \
  --max-cudagraph-capture-size 512 \
  --mamba-ssm-cache-dtype float32 \
  --reasoning-parser nemotron_v3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --port 8000 \
  --host 0.0.0.0 \
  --max-model-len 262144 \
  --load-format fastsafetensors 2>&1

uvx llama-benchy --base-url http://127.0.0.1:8000/v1 --model chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 --pp 2048 --depth 4096 16000 32000 128000 256000
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.5)
Date: 2026-03-30 19:22:18
Benchmarking model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 at http://127.0.0.1:8000/v1
Concurrency levels: [1]
Loading text from cache: /root/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 143827
Warming up...
Warmup (User only) complete. Delta: 33 tokens (Server: 55, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)

Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 3.37 ms
Running test: pp=2048, tg=32, depth=4096, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=128000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=256000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Printing results in MD format:



| model                                      |             test |               t/s |            peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:-------------------------------------------|-----------------:|------------------:|--------------------:|-------------------:|-------------------:|-------------------:|
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   pp2048 @ d4096 |   8578.40 ± 30.83 |                     |      719.60 ± 2.58 |      716.23 ± 2.58 |      719.69 ± 2.58 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |     tg32 @ d4096 |      56.84 ± 0.05 |        58.68 ± 0.05 |                    |                    |                    |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |  pp2048 @ d16000 |    8011.36 ± 3.26 |                     |     2256.17 ± 0.92 |     2252.80 ± 0.92 |     2256.27 ± 0.92 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |    tg32 @ d16000 |      56.87 ± 0.10 |        58.71 ± 0.11 |                    |                    |                    |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |  pp2048 @ d32000 |    7346.73 ± 8.63 |                     |     4637.83 ± 5.45 |     4634.45 ± 5.45 |     4637.90 ± 5.47 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |    tg32 @ d32000 |      55.52 ± 1.58 |        61.30 ± 4.00 |                    |                    |                    |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d128000 |  4757.62 ± 103.09 |                     |  27350.86 ± 590.90 |  27347.49 ± 590.90 |  27350.92 ± 590.91 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   tg32 @ d128000 | 8095.44 ± 6004.28 | 45761.69 ± 46158.55 |                    |                    |                    |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d256000 |   3056.12 ± 70.90 |                     | 84485.71 ± 1977.42 | 84482.34 ± 1977.42 | 84485.76 ± 1977.42 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   tg32 @ d256000 | 3200.18 ± 3680.35 |  8432.24 ± 10133.48 |                    |                    |                    |

llama-benchy (0.3.5)
date: 2026-03-30 19:22:18 | latency mode: api

Digital_David · March 30, 2026, 5:57pm

I was just setting up the same.. But using @eugr spark-vllm-docker system.

In Oepenclaw the initial response was a little slow, but next few commands with tools is about twice as fast as Nano version. I did nave to dial back the –gpu-memory-utilization to 0.80. 124.2 Gig of System Mem was cutting it a little close.

eugr · March 30, 2026, 6:31pm

OK, looks like it’s not crashing with these two PRs, but at least for Nemotron-3-super, the performance is:

Higher for PP, e.g.: ~2140 t/s vs. ~1700 t/s with Marlin/VLLM_CUTLASS at 8192 token context
Slightly lower for TG, e.g. 14.5 vs. 15.5

I think I’ll keep the recipes pinned to Marlin/VLLM_CUTLASS for now, maybe at least until autotuner errors are gone, but will update the build to include these PRs (actually, just Flashinfer one for now, as vLLM one is merged).

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048	2205.39 ± 13.67		934.82 ± 5.78	928.67 ± 5.78	934.99 ± 5.79
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32	14.42 ± 0.07	15.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d4096	2195.56 ± 7.67		2804.56 ± 9.76	2798.40 ± 9.76	2804.75 ± 9.75
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32 @ d4096	14.36 ± 0.01	15.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d8192	2159.83 ± 19.23		4747.63 ± 42.48	4741.48 ± 42.48	4747.77 ± 42.55
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32 @ d8192	14.47 ± 0.11	15.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d16384	2122.38 ± 5.93		8690.80 ± 24.31	8684.65 ± 24.31	8690.93 ± 24.28
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32 @ d16384	14.50 ± 0.19	15.33 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d32078	2003.24 ± 7.67		17041.63 ± 65.53	17035.48 ± 65.53	17041.90 ± 65.76
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32 @ d32078	14.33 ± 0.04	15.00 ± 0.00

llama-benchy (0.3.5)
date: 2026-03-30 11:29:12 | latency mode: api | pp basis: ttfr

johnny_nv · March 30, 2026, 6:37pm

There are open PRs that improve nvfp4 and nemotron super. No worries!

eugr · March 30, 2026, 9:44pm

Rebuilt from main again and restarted my Spark (because it crashed due to shutdown issue) - getting better performance now. Not sure what worked, but I’m definitely including this PR into the next run.

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048	1868.93 ± 551.73		1240.96 ± 459.45	1231.45 ± 459.45	1241.11 ± 459.45
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32	15.27 ± 0.04	16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d4096	1552.78 ± 814.54		6943.61 ± 5704.38	6934.10 ± 5704.38	6943.71 ± 5704.39
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32 @ d4096	15.17 ± 0.02	16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d8192	2220.81 ± 6.41		4620.48 ± 13.30	4610.97 ± 13.30	4620.59 ± 13.30
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32 @ d8192	15.21 ± 0.08	16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d16384	2179.33 ± 6.15		8467.23 ± 23.89	8457.72 ± 23.89	8467.32 ± 23.88
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32 @ d16384	15.27 ± 0.17	16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d32078	2051.58 ± 6.14		16643.66 ± 49.85	16634.15 ± 49.85	16643.78 ± 49.88
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg32 @ d32078	15.20 ± 0.06	16.00 ± 0.00

llama-benchy (0.3.5)
date: 2026-03-30 14:42:37 | latency mode: api | pp basis: ttfr

eugr · March 30, 2026, 10:08pm

BTW, flashinfer autotuner can be disabled with --kernel_config '{"enable_flashinfer_autotune": false}'. Since it’s failing now anyway, doesn’t affect the performance in any meaningful way, but eliminates annoying error traces.

trystan1 · March 30, 2026, 11:00pm

There are several PRs coming down the pipe to boost cutlass/flashinfer kernels. That seems to be where all the development attention is going.

They all depend on one another so my hat goes off to courageous optimizers of all things cuda.

Topic		Replies	Views
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6121	March 28, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1480	January 7, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2218	December 25, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1111	February 13, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2708	December 31, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4143	February 27, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	73	3947	April 10, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	214	4773	March 27, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2386	March 26, 2026
Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark DGX Spark / GB10 jetson , llama , nemotron	7	1541	February 23, 2026

PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

Related topics