PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

Yeah, vLLM → FlashInfer → CUTLASS → TRT-LLM. To add to complexity, looks like NVIDIA works on CUTLASS in their internal repositories, so what we see on GitHub does not represent the entire story.

Exciting stuff! Any idea how close we are to --speculative_config working with nemotron_h_mtp ?

@eugr So far this recipe is stable as nemotron-3-cascade-nvfp4.yaml in your spark-vllm-docker structure :

# Recipe: nemotron-3-cascade-nvfp4
# Nemotron-3-Nano model with NVFP4 quantization support
# Currently can only be run in solo mode, cluster mode fails with error

recipe_version: "1"
name: nemotron-3-cascade-nvfp4
description: vLLM serving nemotron-3-cascade-nvfp4 on a SINGLE NODE ONLY!

# HuggingFace model to download (optional, for --download-model)
model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4

# Container image to use
container: vllm-node-cas

# This model can only run on single node (solo)
solo_only: true

# No mods required
mods: 
  - mods/nemotron-nano

# The vLLM serve command template
command: |
 vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 512 \
  --enable-prefix-caching \
  --max-cudagraph-capture-size 512 \
  --mamba-ssm-cache-dtype float32 \
  --reasoning-parser nemotron_v3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --port 8000 \
  --host 0.0.0.0 \
  --max-model-len 262144 \
  --load-format fastsafetensors 2>&1
1 Like

BTW, you can remove the mod - nemotron_v3 parser is included in vLLM.

2 Likes

Adding some llama.cpp KV cache quantization data as a comparison point for the vLLM/NVFP4 discussion.

I benchmarked --cache-type-k q4_0 --cache-type-v q4_0 vs q8_0 vs f16 on my Spark running Nemotron-3-Nano-30B-A3B at 128K context via llama.cpp b8399. Key finding: basic q4_0 KV cache is unusable at scale - prompt processing collapses 92.5% at 64K context (282.7 to 21.3 tok/s) due to software dequantization overhead.

This reinforces why hardware-accelerated quantization (NVFP4 via TRT-LLM, or TurboQuant’s dequant-free approach) matters so much on Spark. The software dequantization path simply doesn’t scale with context length.

For anyone still on llama.cpp: --cache-type-k q8_0 --cache-type-v q8_0 is the only KV cache quantization worth running - stays within 5% of f16 speed at all context lengths.

Happy to share the full benchmark writeup with tables and methodology if anyone is interested.

1 Like

We have it internally. It will be public PR soon

11 Likes

NOB question. If I am running an LLM model that is 30B in size, why is it filling up 118Gig in memory size? What flag or combination of flags in the recipe above causes this?

As to my knowledge, vLLM pre-allocates memory, but you can limit it by using --gpu-memory-utilization.

If you dial it down to --gpu-memory-utilization 0.50.
I can’t do the math on how much is needed for that model, but if you see errors on insuffient memory, dial it a bit up until how have is working…

1 Like

The gpu-memory-utilization is 0.9 by default. It correlates roughly to 118 GiB as the target budget. Lower it and target budget should drop accordingly.

The other posters are correct. vLLM will allocate all memory up to gpu-memory-utilization setting which is 0.9 by default (and 0.7 in most of our recipes). It does it to maximize KV cache for better concurrency.

2 Likes

Hi @johnny_nv ,

Many thanks for all the good work on the NVFP4 implementation and supporting the DGX Spark community.
I am looking forward to the improvement PR.

I wonder, can you tell us when about or if we will see the following NVFP4 functionalities also in the DGX Spark?

1. Functionality: NVFP4 KV Cache
Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

2. Functionality: NVFP4 Training and Inference
3 Ways NVFP4 Accelerates AI Training and Inference

Many thanks for your update in advance!

2 Likes

Related to KV-Cache there is an open PR on vllm, same with turboquant.
Related to training, right now i am focus to solve and get feedback from community from inference side.

9 Likes

OK, with the latest PRs, autotuner finally skips kernels that cause errors on Spark:

(EngineCore pid=94) INFO 04-01 17:56:55 [monitor.py:48] torch.compile took 18.89 s in total
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=512: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=256 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=512: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=256 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=256: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=128 N=128 K_elem=512: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping SM120 FP4 tile M=256 N=128 K_elem=256: exceeds SMEM limit
(EngineCore pid=94) INFO 04-01 17:56:58 [monitor.py:76] Initial profiling/warmup run took 3.26 s
(EngineCore pid=94) INFO 04-01 17:56:59 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/d06b05e56b/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=94) INFO 04-01 17:56:59 [backends.py:1111] Dynamo bytecode transform time: 0.57 s
(EngineCore pid=94) INFO 04-01 17:57:03 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 3.64 s
(EngineCore pid=94) INFO 04-01 17:57:03 [backends.py:895] collected artifacts: 2 entries, 2 artifacts, 6139512 bytes total
(EngineCore pid=94) INFO 04-01 17:57:03 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/16a2addbe5f6da1f676427646696043a9dfc341085983d55745e70ed286b56b2/rank_0_0/model
(EngineCore pid=94) INFO 04-01 17:57:03 [monitor.py:48] torch.compile took 4.32 s in total
[TensorRT-LLM][DEBUG] SM121: skipping tile M=128 N=128 K=64 stages=4: exceeds SMEM limit
[TensorRT-LLM][DEBUG] SM121: skipping tile M=128 N=128 K=64 stages=4: exceeds SMEM limit
(EngineCore pid=94) INFO 04-01 17:57:03 [monitor.py:76] Initial profiling/warmup run took 0.53 s
(EngineCore pid=94) WARNING 04-01 17:57:08 [kv_cache_utils.py:1059] Add 5 padding layers, may waste at most 12.50% KV cache memory
(EngineCore pid=94) INFO 04-01 17:57:08 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=120
(EngineCore pid=94) WARNING 04-01 17:57:08 [gpu_model_runner.py:6377] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE
(EngineCore pid=94) INFO 04-01 17:57:08 [gpu_model_runner.py:5881] Profiling CUDA graph memory: PIECEWISE=18 (largest=120)
(EngineCore pid=94) INFO 04-01 17:57:11 [gpu_model_runner.py:5960] Estimated CUDA graph memory: 0.33 GiB total
(EngineCore pid=94) INFO 04-01 17:57:11 [gpu_worker.py:436] Available KV cache memory: 18.04 GiB
(EngineCore pid=94) INFO 04-01 17:57:11 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.8000 to 0.8027 to maintain the same effective KV cache size.
(EngineCore pid=94) WARNING 04-01 17:57:11 [kv_cache_utils.py:1059] Add 5 padding layers, may waste at most 12.50% KV cache memory
(EngineCore pid=94) INFO 04-01 17:57:11 [kv_cache_utils.py:1319] GPU KV cache size: 694,656 tokens
(EngineCore pid=94) INFO 04-01 17:57:11 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 2.63x
(EngineCore pid=94) 2026-04-01 17:57:12,303 - INFO - autotuner.py:455 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=94) cudnn_handle created for device_id = 0
(EngineCore pid=94)
(EngineCore pid=94) 2026-04-01 17:57:14,490 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:14,555 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:14,661 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:14,838 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:15,120 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:15,538 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:16,048 - INFO - autotuner.py:817 - flashinfer.jit: [Autotuner]: Skipped 2 unsupported tactic(s) for trtllm::fused_moe::gemm2 (enable debug logs to see details)
(EngineCore pid=94) 2026-04-01 17:57:25,174 - INFO - autotuner.py:464 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:03<00:00,  5.38it/s]
(EngineCore pid=94) INFO 04-01 17:57:29 [gpu_model_runner.py:6051] Graph capturing finished in 4 secs, took -0.10 GiB
(EngineCore pid=94) INFO 04-01 17:57:29 [gpu_worker.py:597] CUDA graph pool memory: -0.1 GiB (actual), 0.33 GiB (estimated), difference: 0.43 GiB (46398668800.0%).
(EngineCore pid=94) INFO 04-01 17:57:29 [core.py:283] init engine (profile, create kv cache, warmup model) took 54.93 seconds

This is a part of my latest nightly build, btw.

6 Likes

@johnny_nv Usually when someone says my name it’s to yell at me (lol). My mom just had surgery so I’ve been taking care of her out of town and haven’t been up on the forums much. I’ve updated the mentioned PR and stripped out k=64 tiles since I couldn’t show them to improve performance at all. There’s just a basic fix that I think resolves the issue on k=128 tiles.

1 Like

Built Madreag’s turbo3-cuda fork (release/cuda-optimized branch) on my DGX Spark. First SM 121 turbo3/turbo4 data.

Finding: turbo3/turbo4 are slower than f16 on GB10 unified memory.

Token generation (tg32) with Nemotron-3-Nano-30B-A3B Q4_K_XL:

Depth f16 turbo4 turbo3
0 45.21 44.06 43.66
8192 43.37 39.49 40.60
32768 41.61 31.81 32.09

Up to -23.6% at 32K context. Dequantization overhead dominates when you have 128GB unified memory and no VRAM pressure. Consistent with eugr’s earlier KV cache findings on GB10.

Recommendation for Spark users: stick with f16 KV cache.

Full data and reproduction steps available - happy to share.

Openclaw / Neomtron-3-Super is why I purchased the DGX Spark, but the 3 minute initial load time, 1-3 minute response times, 15 t/s and the crashes every few hours, is infuriating at this point.

I do hope the team working on this see this as unacceptable, as it is a Far cry from the advertised 1 TFLOP & 100 tokens/sec NVFP4 performance … @aniculescu

Here let me fix that for you NVIDIA. we Think it can do 1 TFLOP & 100 tokens/sec NVFP4 performance, but you have to figure that out for yourself, and if you do let us know, we’ll happily take the credit for it.

I nominate that the guys that have stuck around @eugr and Earned the Spark Expert badge, should be given full access and a stack of Sparks as they are essentially doing NVIDIA’s job for them :-(

ok, now back to my coffee.

8 Likes

1 TFLOPs is more about prefill stage than decode, I’ve put more hopes into boosting prefill rather significantly speeding up decode/token generation as it more depends on memory bandwidth (spark has enough horse power to dequantize/compute with ~20% penalty max at decode). So, at 12B of active parameters for nemo 3 super we should get max with 220GBps memory bandwidth/12GB (12B weight @ 8bit) = 18.3TPS. I know specs says it should be 278GBps and we should push thru memory 4bit only, but that is not the case without newest drivers/cuda + crazy software approximation of missing hardware instruction.

Openclaw/long live sessions tends to accumulate 100k+ tokens and if you have only 2k tps prefill you have to wait 50s for it to path thru if you don’t have enough KV cache for prefill. Actual output for agents rarely more than 1k tokens (which is alot, 66 seconds @15tps15tps15tps15tps), but mostly in ranges of 100 tokens.

I run minimax on 2 sparks with cutlass + roce@200G link, and get at the moment ~380tps prefill,
nvidia/MiniMax-M2.5-NVFP4 | pp128000 | 378.87 ± 1.58
nvidia/MiniMax-M2.5-NVFP4 | tg8096 | 12.48 ± 0.03
which is really painful :(

2 Likes

Thank you and understand most of that.

Would be great if there was a way to toggle output in Openclaw, to only use prefill tokens to execute tasks and don’t generate tokens to display, until there is a final outcome is needed. Or possible run in parallel for task as concurrent up to 32 seems to loose no tg/s generation speed.

Since this is is a dedicated hardware for Openclaw on my end, and the OS seems to be very locked down with Sudo and a restricted/custom os. Would it be faster to just install source code and remove all of the container layers and restrictions?

I do not understand the overall bottleneck fully yet, but there has to be a way to get past this bottleneck somehow.

vLLM handles cache-prefix very well, so incremental contexts aren’t a huge pain usually for single session.

I don’t know how OpenClaw manages the context (compaction vs rolling-window), but with the right setup and right model, you shouldn’t suffer too much.

Have you tried the Qwen 3.5 122B int4-autoround in that context? I’m pretty confident it would be a good model for OpenClaw, since it’s quite competent for agentic coding workflows.

4 Likes

I’m a little impatient and prefer the response be at least human reading speeds >39 tg/s and Qwen3.5-35B-A3B-FP8 is in the 40’s . It does seem that the larger 120is models are all around 14-28tg/s with a massieve prefill/load hit as well. I.E. hit send and then take a break to come back minutes later…

Would be Great is the model makers would start a new standard in the 70B range for a nice balance.

Will give it a try though, as well as the new Gemma models.

1 Like