PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

eugr · March 30, 2026, 11:06pm

Yeah, vLLM → FlashInfer → CUTLASS → TRT-LLM. To add to complexity, looks like NVIDIA works on CUTLASS in their internal repositories, so what we see on GitHub does not represent the entire story.

chibri · March 31, 2026, 12:13am

Exciting stuff! Any idea how close we are to --speculative_config working with nemotron_h_mtp ?

Digital_David · March 31, 2026, 12:17am

@eugr So far this recipe is stable as nemotron-3-cascade-nvfp4.yaml in your spark-vllm-docker structure :

# Recipe: nemotron-3-cascade-nvfp4
# Nemotron-3-Nano model with NVFP4 quantization support
# Currently can only be run in solo mode, cluster mode fails with error

recipe_version: "1"
name: nemotron-3-cascade-nvfp4
description: vLLM serving nemotron-3-cascade-nvfp4 on a SINGLE NODE ONLY!

# HuggingFace model to download (optional, for --download-model)
model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4

# Container image to use
container: vllm-node-cas

# This model can only run on single node (solo)
solo_only: true

# No mods required
mods: 
  - mods/nemotron-nano

# The vLLM serve command template
command: |
 vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 512 \
  --enable-prefix-caching \
  --max-cudagraph-capture-size 512 \
  --mamba-ssm-cache-dtype float32 \
  --reasoning-parser nemotron_v3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --port 8000 \
  --host 0.0.0.0 \
  --max-model-len 262144 \
  --load-format fastsafetensors 2>&1

eugr · March 31, 2026, 12:20am

BTW, you can remove the mod - nemotron_v3 parser is included in vLLM.

nmaine · March 31, 2026, 1:54am

Adding some llama.cpp KV cache quantization data as a comparison point for the vLLM/NVFP4 discussion.

I benchmarked --cache-type-k q4_0 --cache-type-v q4_0 vs q8_0 vs f16 on my Spark running Nemotron-3-Nano-30B-A3B at 128K context via llama.cpp b8399. Key finding: basic q4_0 KV cache is unusable at scale - prompt processing collapses 92.5% at 64K context (282.7 to 21.3 tok/s) due to software dequantization overhead.

This reinforces why hardware-accelerated quantization (NVFP4 via TRT-LLM, or TurboQuant’s dequant-free approach) matters so much on Spark. The software dequantization path simply doesn’t scale with context length.

For anyone still on llama.cpp: --cache-type-k q8_0 --cache-type-v q8_0 is the only KV cache quantization worth running - stays within 5% of f16 speed at all context lengths.

Happy to share the full benchmark writeup with tables and methodology if anyone is interested.

johnny_nv · March 31, 2026, 4:24am

We have it internally. It will be public PR soon

Digital_David · March 31, 2026, 2:16pm

NOB question. If I am running an LLM model that is 30B in size, why is it filling up 118Gig in memory size? What flag or combination of flags in the recipe above causes this?

jl121 · March 31, 2026, 2:32pm

As to my knowledge, vLLM pre-allocates memory, but you can limit it by using --gpu-memory-utilization.

If you dial it down to --gpu-memory-utilization 0.50.
I can’t do the math on how much is needed for that model, but if you see errors on insuffient memory, dial it a bit up until how have is working…

thomas.developer1 · March 31, 2026, 2:52pm

The gpu-memory-utilization is 0.9 by default. It correlates roughly to 118 GiB as the target budget. Lower it and target budget should drop accordingly.

eugr · March 31, 2026, 4:21pm

The other posters are correct. vLLM will allocate all memory up to gpu-memory-utilization setting which is 0.9 by default (and 0.7 in most of our recipes). It does it to maximize KV cache for better concurrency.

chrm · March 31, 2026, 4:44pm

Hi @johnny_nv ,

Many thanks for all the good work on the NVFP4 implementation and supporting the DGX Spark community.
I am looking forward to the improvement PR.

I wonder, can you tell us when about or if we will see the following NVFP4 functionalities also in the DGX Spark?

1. Functionality: NVFP4 KV Cache
Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

2. Functionality: NVFP4 Training and Inference
3 Ways NVFP4 Accelerates AI Training and Inference

Many thanks for your update in advance!

johnny_nv · April 1, 2026, 10:08am

Related to KV-Cache there is an open PR on vllm, same with turboquant.
Related to training, right now i am focus to solve and get feedback from community from inference side.

eugr · April 1, 2026, 6:00pm

OK, with the latest PRs, autotuner finally skips kernels that cause errors on Spark:

(EngineCore pid=94) INFO 04-01 17:56:55 [monitor.py:48] torch.compile took 18.89 s in total
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=512: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=256 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=512: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=256 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=512: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=256 N=128 K_elem=256: exceeds SMEM limit
(EngineCore pid=94) INFO 04-01 17:56:58 [monitor.py:76] Initial profiling/warmup run took 3.26 s
(EngineCore pid=94) INFO 04-01 17:56:59 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/d06b05e56b/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=94) INFO 04-01 17:56:59 [backends.py:1111] Dynamo bytecode transform time: 0.57 s
(EngineCore pid=94) INFO 04-01 17:57:03 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 3.64 s
(EngineCore pid=94) INFO 04-01 17:57:03 [backends.py:895] collected artifacts: 2 entries, 2 artifacts, 6139512 bytes total
(EngineCore pid=94) INFO 04-01 17:57:03 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/16a2addbe5f6da1f676427646696043a9dfc341085983d55745e70ed286b56b2/rank_0_0/model
(EngineCore pid=94) INFO 04-01 17:57:03 [monitor.py:48] torch.compile took 4.32 s in total
[TensorRT-LLM][DEBUG] SM121: skipping tile M=128 N=128 K=64 stages=4: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping tile M=128 N=128 K=64 stages=4: exceeds SMEM limit
(EngineCore pid=94) INFO 04-01 17:57:03 [monitor.py:76] Initial profiling/warmup run took 0.53 s
(EngineCore pid=94) WARNING 04-01 17:57:08 [kv_cache_utils.py:1059] Add 5 padding layers, may waste at most 12.50% KV cache memory
(EngineCore pid=94) INFO 04-01 17:57:08 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=120
(EngineCore pid=94) WARNING 04-01 17:57:08 [gpu_model_runner.py:6377] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE
(EngineCore pid=94) INFO 04-01 17:57:08 [gpu_model_runner.py:5881] Profiling CUDA graph memory: PIECEWISE=18 (largest=120)
(EngineCore pid=94) INFO 04-01 17:57:11 [gpu_model_runner.py:5960] Estimated CUDA graph memory: 0.33 GiB total
(EngineCore pid=94) INFO 04-01 17:57:11 [gpu_worker.py:436] Available KV cache memory: 18.04 GiB
(EngineCore pid=94) INFO 04-01 17:57:11 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.8000 to 0.8027 to maintain the same effective KV cache size.
(EngineCore pid=94) WARNING 04-01 17:57:11 [kv_cache_utils.py:1059] Add 5 padding layers, may waste at most 12.50% KV cache memory
(EngineCore pid=94) INFO 04-01 17:57:11 [kv_cache_utils.py:1319] GPU KV cache size: 694,656 tokens
(EngineCore pid=94) INFO 04-01 17:57:11 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 2.63x
(EngineCore pid=94) 2026-04-01 17:57:12,303 - INFO - autotuner.py:455 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=94) cudnn_handle created for device_id = 0
(EngineCore pid=94)
(EngineCore pid=94) 2026-04-01 17:57:14,490 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:14,555 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:14,661 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:14,838 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:15,120 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:15,538 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:16,048 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:25,174 - INFO - autotuner.py:464 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:03<00:00,  5.38it/s]
(EngineCore pid=94) INFO 04-01 17:57:29 [gpu_model_runner.py:6051] Graph capturing finished in 4 secs, took -0.10 GiB
(EngineCore pid=94) INFO 04-01 17:57:29 [gpu_worker.py:597] CUDA graph pool memory: -0.1 GiB (actual), 0.33 GiB (estimated), difference: 0.43 GiB (46398668800.0%).
(EngineCore pid=94) INFO 04-01 17:57:29 [core.py:283] init engine (profile, create kv cache, warmup model) took 54.93 seconds

This is a part of my latest nightly build, btw.

tenari · April 1, 2026, 7:55pm

@johnny_nv Usually when someone says my name it’s to yell at me (lol). My mom just had surgery so I’ve been taking care of her out of town and haven’t been up on the forums much. I’ve updated the mentioned PR and stripped out k=64 tiles since I couldn’t show them to improve performance at all. There’s just a basic fix that I think resolves the issue on k=128 tiles.

nmaine · April 1, 2026, 10:19pm

Built Madreag’s turbo3-cuda fork (release/cuda-optimized branch) on my DGX Spark. First SM 121 turbo3/turbo4 data.

Finding: turbo3/turbo4 are slower than f16 on GB10 unified memory.

Token generation (tg32) with Nemotron-3-Nano-30B-A3B Q4_K_XL:

Depth	f16	turbo4	turbo3
0	45.21	44.06	43.66
8192	43.37	39.49	40.60
32768	41.61	31.81	32.09

Up to -23.6% at 32K context. Dequantization overhead dominates when you have 128GB unified memory and no VRAM pressure. Consistent with eugr’s earlier KV cache findings on GB10.

Recommendation for Spark users: stick with f16 KV cache.

Full data and reproduction steps available - happy to share.

Digital_David · April 2, 2026, 2:16pm

Openclaw / Neomtron-3-Super is why I purchased the DGX Spark, but the 3 minute initial load time, 1-3 minute response times, 15 t/s and the crashes every few hours, is infuriating at this point.

I do hope the team working on this see this as unacceptable, as it is a Far cry from the advertised 1 TFLOP & 100 tokens/sec NVFP4 performance … @aniculescu

Here let me fix that for you NVIDIA. we Think it can do 1 TFLOP & 100 tokens/sec NVFP4 performance, but you have to figure that out for yourself, and if you do let us know, we’ll happily take the credit for it.

I nominate that the guys that have stuck around @eugr and Earned the Spark Expert badge, should be given full access and a stack of Sparks as they are essentially doing NVIDIA’s job for them :-(

ok, now back to my coffee.

Teason2026 · April 3, 2026, 4:50am

1 TFLOPs is more about prefill stage than decode, I’ve put more hopes into boosting prefill rather significantly speeding up decode/token generation as it more depends on memory bandwidth (spark has enough horse power to dequantize/compute with ~20% penalty max at decode). So, at 12B of active parameters for nemo 3 super we should get max with 220GBps memory bandwidth/12GB (12B weight @ 8bit) = 18.3TPS. I know specs says it should be 278GBps and we should push thru memory 4bit only, but that is not the case without newest drivers/cuda + crazy software approximation of missing hardware instruction.

Openclaw/long live sessions tends to accumulate 100k+ tokens and if you have only 2k tps prefill you have to wait 50s for it to path thru if you don’t have enough KV cache for prefill. Actual output for agents rarely more than 1k tokens (which is alot, 66 seconds @15tps15tps15tps15tps), but mostly in ranges of 100 tokens.

I run minimax on 2 sparks with cutlass + roce@200G link, and get at the moment ~380tps prefill,
nvidia/MiniMax-M2.5-NVFP4 | pp128000 | 378.87 ± 1.58
nvidia/MiniMax-M2.5-NVFP4 | tg8096 | 12.48 ± 0.03
which is really painful :(

Digital_David · April 3, 2026, 11:27am

Thank you and understand most of that.

Would be great if there was a way to toggle output in Openclaw, to only use prefill tokens to execute tasks and don’t generate tokens to display, until there is a final outcome is needed. Or possible run in parallel for task as concurrent up to 32 seems to loose no tg/s generation speed.

Since this is is a dedicated hardware for Openclaw on my end, and the OS seems to be very locked down with Sudo and a restricted/custom os. Would it be faster to just install source code and remove all of the container layers and restrictions?

I do not understand the overall bottleneck fully yet, but there has to be a way to get past this bottleneck somehow.

co-le · April 3, 2026, 11:55am

vLLM handles cache-prefix very well, so incremental contexts aren’t a huge pain usually for single session.

I don’t know how OpenClaw manages the context (compaction vs rolling-window), but with the right setup and right model, you shouldn’t suffer too much.

Have you tried the Qwen 3.5 122B int4-autoround in that context? I’m pretty confident it would be a good model for OpenClaw, since it’s quite competent for agentic coding workflows.

Digital_David · April 3, 2026, 12:12pm

I’m a little impatient and prefer the response be at least human reading speeds >39 tg/s and Qwen3.5-35B-A3B-FP8 is in the 40’s . It does seem that the larger 120is models are all around 14-28tg/s with a massieve prefill/load hit as well. I.E. hit send and then take a break to come back minutes later…

Would be Great is the model makers would start a new standard in the 70B range for a nice balance.

Will give it a try though, as well as the new Gemma models.

Topic		Replies	Views
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6128	March 28, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1480	January 7, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2218	December 25, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1115	February 13, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2709	December 31, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4144	February 27, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	73	3968	April 10, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	214	4776	March 27, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2387	March 26, 2026
Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark DGX Spark / GB10 jetson , llama , nemotron	7	1543	February 23, 2026

PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

Related topics