PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

FWIW:

=== GPQA Diamond ===
base_url: http://gb10:8000/v1
model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4
questions: 198
repeats: 5
total eval calls: 990
score (all repeats): 0.7545 (75.45%)
correct / total: 747 / 990
failed requests: 0
prompt tokens total: 272,511
completion tokens total: 16,691,185
total tokens: 16,963,696
avg tokens / call: 17135.0
wall time (s): 18433.3

Full weights received 76.1 based on the model card: nvidia/Nemotron-Cascade-2-30B-A3B · Hugging Face

Honestly very cool :)

5 hours and 17 million tokens later, stability and accuracy :)

5 Likes

With which pr do we construct the image there are several pr. thanks

Johnnys guide here works

3 Likes

If you want to build as normal, clone flashinfer from source, vllm from source then merge in these two PRs into their respective main branches

from the vllm directory after merging:

uv venv --python 3.12 --seed .venv

source .venv/bin/activate

export TORCH_CUDA_ARCH_LIST=12.1a

uv pip install -ve .  --torch-backend=auto --refresh

that’ll take a bit then

uv uninstall flashinfer-cubin flashinfer-python

from the flashinfer directory after merging:

uv pip install --no-build-isolation -e .

when cloning the source of flashinfer make sure you do it recursively since there are submodules for cutlass.

Johnny’s guide has a step that rm -rfs your .cache folder wholesale but the directories to purge before the flashinfer JIT/starting vllm are:

rm -rf ~/.cache/vllm

rm -rf ~/.cache/flashinfer

rm -rf ~/.triton

rm -rf ~/.config/vllm

If you don’t want to nuke all your models :)

4 Likes

Can you post your vLLM serve settings? Would love to give Cascade a try as well.

~/spark-vllm-docker$ uvx llama-benchy   --base-url http://127.0.0.1:8000/v1   --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 --pp 2048   --depth 4096 16000 32000
Installed 49 packages in 30ms
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.5)
Date: 2026-03-30 17:26:46
Benchmarking model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 at http://127.0.0.1:8000/v1
Concurrency levels: [1]
Downloading book from https://www.gutenberg.org/files/1661/1661-0.txt...
Saved text to cache: /home/csolutions_ai/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 162015
Warming up...
Warmup (User only) complete. Delta: 16 tokens (Server: 38, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)

Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 2.35 ms
Running test: pp=2048, tg=32, depth=4096, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Printing results in MD format:



| model                                       |            test |               t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:--------------------------------------------|----------------:|------------------:|-------------:|------------------:|------------------:|------------------:|
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |  pp2048 @ d4096 | 5909.41 ± 3864.42 |              | 5085.66 ± 6183.45 | 5083.31 ± 6183.45 | 5085.71 ± 6183.45 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |    tg32 @ d4096 |      56.28 ± 0.04 | 58.10 ± 0.04 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 @ d16000 |  7660.58 ± 529.70 |              |  2370.16 ± 172.01 |  2367.81 ± 172.01 |  2370.22 ± 172.00 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |   tg32 @ d16000 |      56.46 ± 0.09 | 58.28 ± 0.10 |                   |                   |                   |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 @ d32000 |  7263.28 ± 169.66 |              |  4692.55 ± 111.45 |  4690.20 ± 111.45 |  4692.61 ± 111.46 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |   tg32 @ d32000 |      56.54 ± 0.10 | 58.37 ± 0.11 |                   |                   |                   |

llama-benchy (0.3.5)
date: 2026-03-30 17:26:46 | latency mode: api
~/spark-vllm-docker$ 

buying my Nvidia cap,

3 Likes

before you run these with the JIT flashinfer build you’ll want to set:

export MAX_JOBS=8 to prevent compilation from OOM

also uv pip install fastsafetensors

and

sudo bash -c “echo 8192 > /sys/block/nvme0n1/queue/read_ahead_kb”

if you want the weight loading to scream

vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 --kv-cache-dtype fp8 --trust-remote-code --gpu-memory-utilization 0.85 --max-num-seqs 512 --enable-prefix-caching --max-cudagraph-capture-size 512 --mamba-ssm-cache-dtype float32 --reasoning-parser nemotron_v3 --tool-call-parser qwen3_coder --enable-auto-tool-choice --port 8000 --host 0.0.0.0 --load-format fastsafetensors

That was what was used in the gpqa accuracy run I posted previously.

I’m comparing the impact of changing --mamba-ssm-cache-dtype float16 accuracy wise with another run currently to see if there’s a huge accuracy falloff (relative good performance bump with that but time will tell if the model turns into a potato)

if that goes alright i’m going to take it a little further with --mamba-cache-dtype down to float16 so both the ssm state and conv state get the same treatment and further see if there’s an impact.

Considering all the hot models right now are mamba/causal conv1d I’d really like to see firsthand the impact.

2 Likes
vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --mamba_ssm_cache_dtype float32 \
    --max-model-len 262144 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nemotron_v3 \
    --kv-cache-dtype fp8

$ uvx llama-benchy   --base-url http://127.0.0.1:8000/v1   --model chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 --pp 2048   --depth 4096 16000 32000 128000
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.5)
Date: 2026-03-30 17:47:43
Benchmarking model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 at http://127.0.0.1:8000/v1
Concurrency levels: [1]
Loading text from cache: /home/csolutions_ai/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 143827
Warming up...
Warmup (User only) complete. Delta: 33 tokens (Server: 55, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)

Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 5.59 ms
Running test: pp=2048, tg=32, depth=4096, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=128000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Printing results in MD format:



| model                                      |             test |               t/s |        peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:-------------------------------------------|-----------------:|------------------:|----------------:|------------------:|------------------:|------------------:|
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   pp2048 @ d4096 | 6178.99 ± 1076.38 |                 |  1028.80 ± 166.52 |  1023.21 ± 166.52 |  1029.00 ± 166.69 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |     tg32 @ d4096 |      59.47 ± 1.00 |    61.71 ± 0.81 |                   |                   |                   |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |  pp2048 @ d16000 |  7368.29 ± 503.82 |                 |  2467.01 ± 176.24 |  2461.42 ± 176.24 |  2467.08 ± 176.27 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |    tg32 @ d16000 |      57.85 ± 1.38 |    60.52 ± 0.44 |                   |                   |                   |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |  pp2048 @ d32000 |   7334.35 ± 52.13 |                 |   4648.09 ± 32.89 |   4642.50 ± 32.89 |   4648.15 ± 32.90 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |    tg32 @ d32000 |      56.24 ± 2.23 |   74.63 ± 21.18 |                   |                   |                   |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d128000 |   4709.97 ± 20.59 |                 | 27455.18 ± 249.64 | 27611.77 ± 120.71 | 27618.46 ± 120.97 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   tg32 @ d128000 |   236.89 ± 115.28 | 540.34 ± 282.67 |                   |                   |                   |

llama-benchy (0.3.5)
date: 2026-03-30 17:47:43 | latency mode: api

I haven’t done what @trystan1 recommended; it might explode.

2 Likes

The only thing I’m currently recommending would be the purchase of a high quality nvidia baseball cap.

1 Like

The day NVIDIA finally gets NVFP4 officially working on this device (if ever), I’ll consider whether they belong on the good-guys list again. NVFP4 is essential for the DGX Spark, and honestly, it should have been ready when the Spark launched.

Until then, I’d much rather wear a community hat. The people here who invested their time and did the hard work to find workarounds when NVIDIA failed to pull its own weight are the ones who deserve the credit. You guys are amazing.

4 Likes

My guess is there will be a big push to get nvfp4 performance on gb10 between now and the shipping of the n1x systems/laptops later.

It would make sense for nvidia to make the gb10 the dev platform/laptop platform of choice for cuda.

Pure speculation on my part, but seems reasonable.

1 Like

The performance seems to be lower than both Marlin and VLLM_CUTLASS though.

1 Like

Looks like vLLM PR #38423 got merged, so only FI #2913 is left to merge. I’ll run my build with FI #2913 applied and if it solves the issue, then the next community build will include these changes.

6 Likes
vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 512 \
  --enable-prefix-caching \
  --max-cudagraph-capture-size 512 \
  --mamba-ssm-cache-dtype float32 \
  --reasoning-parser nemotron_v3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --port 8000 \
  --host 0.0.0.0 \
  --max-model-len 262144 \
  --load-format fastsafetensors 2>&1
uvx llama-benchy --base-url http://127.0.0.1:8000/v1 --model chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 --pp 2048 --depth 4096 16000 32000 128000 256000
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.5)
Date: 2026-03-30 19:22:18
Benchmarking model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 at http://127.0.0.1:8000/v1
Concurrency levels: [1]
Loading text from cache: /root/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 143827
Warming up...
Warmup (User only) complete. Delta: 33 tokens (Server: 55, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)

Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 3.37 ms
Running test: pp=2048, tg=32, depth=4096, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=128000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=256000, concurrency=1
  Run 1/3 (batch size 1)...
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Printing results in MD format:



| model                                      |             test |               t/s |            peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:-------------------------------------------|-----------------:|------------------:|--------------------:|-------------------:|-------------------:|-------------------:|
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   pp2048 @ d4096 |   8578.40 ± 30.83 |                     |      719.60 ± 2.58 |      716.23 ± 2.58 |      719.69 ± 2.58 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |     tg32 @ d4096 |      56.84 ± 0.05 |        58.68 ± 0.05 |                    |                    |                    |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |  pp2048 @ d16000 |    8011.36 ± 3.26 |                     |     2256.17 ± 0.92 |     2252.80 ± 0.92 |     2256.27 ± 0.92 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |    tg32 @ d16000 |      56.87 ± 0.10 |        58.71 ± 0.11 |                    |                    |                    |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |  pp2048 @ d32000 |    7346.73 ± 8.63 |                     |     4637.83 ± 5.45 |     4634.45 ± 5.45 |     4637.90 ± 5.47 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |    tg32 @ d32000 |      55.52 ± 1.58 |        61.30 ± 4.00 |                    |                    |                    |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d128000 |  4757.62 ± 103.09 |                     |  27350.86 ± 590.90 |  27347.49 ± 590.90 |  27350.92 ± 590.91 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   tg32 @ d128000 | 8095.44 ± 6004.28 | 45761.69 ± 46158.55 |                    |                    |                    |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d256000 |   3056.12 ± 70.90 |                     | 84485.71 ± 1977.42 | 84482.34 ± 1977.42 | 84485.76 ± 1977.42 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 |   tg32 @ d256000 | 3200.18 ± 3680.35 |  8432.24 ± 10133.48 |                    |                    |                    |

llama-benchy (0.3.5)
date: 2026-03-30 19:22:18 | latency mode: api
1 Like

I was just setting up the same.. But using @eugr spark-vllm-docker system.

In Oepenclaw the initial response was a little slow, but next few commands with tools is about twice as fast as Nano version. I did nave to dial back the –gpu-memory-utilization to 0.80. 124.2 Gig of System Mem was cutting it a little close.

OK, looks like it’s not crashing with these two PRs, but at least for Nemotron-3-super, the performance is:

  • Higher for PP, e.g.: ~2140 t/s vs. ~1700 t/s with Marlin/VLLM_CUTLASS at 8192 token context
  • Slightly lower for TG, e.g. 14.5 vs. 15.5

I think I’ll keep the recipes pinned to Marlin/VLLM_CUTLASS for now, maybe at least until autotuner errors are gone, but will update the build to include these PRs (actually, just Flashinfer one for now, as vLLM one is merged).

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 2205.39 ± 13.67 934.82 ± 5.78 928.67 ± 5.78 934.99 ± 5.79
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 14.42 ± 0.07 15.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d4096 2195.56 ± 7.67 2804.56 ± 9.76 2798.40 ± 9.76 2804.75 ± 9.75
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 @ d4096 14.36 ± 0.01 15.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d8192 2159.83 ± 19.23 4747.63 ± 42.48 4741.48 ± 42.48 4747.77 ± 42.55
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 @ d8192 14.47 ± 0.11 15.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d16384 2122.38 ± 5.93 8690.80 ± 24.31 8684.65 ± 24.31 8690.93 ± 24.28
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 @ d16384 14.50 ± 0.19 15.33 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d32078 2003.24 ± 7.67 17041.63 ± 65.53 17035.48 ± 65.53 17041.90 ± 65.76
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 @ d32078 14.33 ± 0.04 15.00 ± 0.00

llama-benchy (0.3.5)
date: 2026-03-30 11:29:12 | latency mode: api | pp basis: ttfr

There are open PRs that improve nvfp4 and nemotron super. No worries!

4 Likes

Rebuilt from main again and restarted my Spark (because it crashed due to shutdown issue) - getting better performance now. Not sure what worked, but I’m definitely including this PR into the next run.

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 1868.93 ± 551.73 1240.96 ± 459.45 1231.45 ± 459.45 1241.11 ± 459.45
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 15.27 ± 0.04 16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d4096 1552.78 ± 814.54 6943.61 ± 5704.38 6934.10 ± 5704.38 6943.71 ± 5704.39
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 @ d4096 15.17 ± 0.02 16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d8192 2220.81 ± 6.41 4620.48 ± 13.30 4610.97 ± 13.30 4620.59 ± 13.30
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 @ d8192 15.21 ± 0.08 16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d16384 2179.33 ± 6.15 8467.23 ± 23.89 8457.72 ± 23.89 8467.32 ± 23.88
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 @ d16384 15.27 ± 0.17 16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d32078 2051.58 ± 6.14 16643.66 ± 49.85 16634.15 ± 49.85 16643.78 ± 49.88
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg32 @ d32078 15.20 ± 0.06 16.00 ± 0.00

llama-benchy (0.3.5)
date: 2026-03-30 14:42:37 | latency mode: api | pp basis: ttfr

5 Likes

BTW, flashinfer autotuner can be disabled with --kernel_config '{"enable_flashinfer_autotune": false}'. Since it’s failing now anyway, doesn’t affect the performance in any meaningful way, but eliminates annoying error traces.

1 Like

There are several PRs coming down the pipe to boost cutlass/flashinfer kernels. That seems to be where all the development attention is going.

They all depend on one another so my hat goes off to courageous optimizers of all things cuda.

3 Likes