FWIW:
=== GPQA Diamond ===
base_url: http://gb10:8000/v1
model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4
questions: 198
repeats: 5
total eval calls: 990
score (all repeats): 0.7545 (75.45%)
correct / total: 747 / 990
failed requests: 0
prompt tokens total: 272,511
completion tokens total: 16,691,185
total tokens: 16,963,696
avg tokens / call: 17135.0
wall time (s): 18433.3
Full weights received 76.1 based on the model card: nvidia/Nemotron-Cascade-2-30B-A3B · Hugging Face
Honestly very cool :)
5 hours and 17 million tokens later, stability and accuracy :)
5 Likes
With which pr do we construct the image there are several pr. thanks
If you want to build as normal, clone flashinfer from source, vllm from source then merge in these two PRs into their respective main branches
main ← johnnynunez:main
opened 10:48PM - 29 Mar 26 UTC
### Summary
- Add missing `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` compile flag to … all CUTLASS fused MoE JIT modules (SM100/SM103/SM120) and `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` to SM90 modules
- Sync nv_internal `grid_dependency_control.h` with upstream CUTLASS to support SM100/SM103/SM110/SM120/SM121 GDC
- Add `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` to FP8 blockscale GEMM SM90 module
### Problem
Random `cudaErrorIllegalInstruction` crashes on DGX Spark (SM121) and RTX 50-series (SM120) when running NVFP4 MoE models (e.g., Nemotron, Qwen3.5-122B) under load. The crashes are intermittent and worsen with longer context lengths and higher concurrency.
**Root cause:** PR #2780 fixed the missing GDC compile flags for GEMM modules (`flashinfer/jit/gemm/core.py`), but the **CUTLASS fused MoE modules** in `flashinfer/jit/fused_moe.py` and the **FP8 blockscale GEMM module** were not fixed. This is the exact same class of bug as #2708.
Without `-DCUTLASS_ENABLE_GDC_FOR_SM100=1`, CUTLASS's `grid_dependency_control.h` compiles `wait_on_dependent_grids()` and `launch_dependent_grids()` as **empty no-ops**:
```cpp
CUTLASS_DEVICE void wait_on_dependent_grids() {
#if (defined(CUTLASS_GDC_ENABLED)) // ← not defined without the flag
asm volatile("griddepcontrol.wait;");
#endif
}
```
Meanwhile, the host-side code still sets `programmaticStreamSerializationAllowed = true` (PDL enabled) via `device_support_pdl()` which returns `True` for all `major >= 9`, including SM12x. This means:
1. **Host enables PDL** → CUDA runtime overlaps consecutive kernels
2. **Device GDC barriers are no-ops** → No synchronization between overlapping kernels
3. **Race condition** → Dependent kernel reads stale global memory → corruption → `cudaErrorIllegalInstruction`
The crash is random because it depends on exact kernel scheduling timing, which varies per request.
### Fix
**`flashinfer/jit/fused_moe.py`** — Added GDC flags to all CUTLASS fused MoE modules:
| Module | Flag | Architectures Covered |
|---|---|---|
| `gen_cutlass_fused_moe_sm120_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` | SM120, SM121 |
| `gen_cutlass_fused_moe_sm103_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` | SM103, SM120, SM121 |
| `gen_cutlass_fused_moe_sm100_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` | SM100, SM110, SM120, SM121 |
| `gen_cutlass_fused_moe_sm90_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` | SM90 |
| `gen_trtllm_gen_fused_moe_sm100_module()` | `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` | SM100+, SM120, SM121 |
**`flashinfer/jit/gemm/fp8_blockscale.py`** — Added `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` to `gen_fp8_blockscale_gemm_sm90_module()`.
**`csrc/nv_internal/.../grid_dependency_control.h`** — Synced with upstream CUTLASS (`3rdparty/cutlass/include/cutlass/arch/grid_dependency_control.h`) to add SM100+ GDC support. Previously only handled SM90, so any nv_internal TensorRT-LLM code compiled for SM12x would have GDC barriers silently compiled as no-ops.
### Why `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` covers SM12x
CUTLASS uses a single flag for the entire Blackwell family. From `grid_dependency_control.h`:
```cpp
#if(CUDA_BARRIER_ENABLED && defined(CUTLASS_ENABLE_GDC_FOR_SM100) && defined(__CUDA_ARCH__) && \
((__CUDA_ARCH__ == 1000 && ...) || // SM100
(__CUDA_ARCH__ == 1030 && ...) || // SM103
(__CUDA_ARCH__ == 1100 && ...) || // SM110
(__CUDA_ARCH__ == 1200 && ...) || // SM120 (RTX 50-series)
(__CUDA_ARCH__ == 1210 && ...))) // SM121 (DGX Spark)
#define CUTLASS_GDC_ENABLED
```
### Why SM90 GDC flag was NOT added to SM100+ modules
PR #2716 attempted to add both `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` and `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` to all modules. It broke AOT builds because `sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp` checks `CUTLASS_ENABLE_GDC_FOR_SM90` and calls `scheduler.is_last_tile()` — a method not present on the SM120 scheduler. PR #2780 corrected this by using only the SM100 flag for SM100+ modules. This PR follows the same approach.
### Related
- #2708 — Original issue: missing GDC flags cause PDL race condition
- #2716 — First fix attempt (reverted — broke AOT)
- #2780 — Corrected fix for GEMM modules only
- [vllm-project/vllm#38423](https://github.com/vllm-project/vllm/pull/38423) — NVFP4 bugfix on DGX Spark
- [NVIDIA/cutlass#3121](https://github.com/NVIDIA/cutlass/pull/3121) — K=64 block-scaled GEMM tiles (separate issue)
### Test plan
- [x] Clear JIT cache: `rm -rf ~/.cache/flashinfer/`
- [x] Run NVFP4 MoE model on SM121 (DGX Spark) with 128K context under load — verify no `cudaErrorIllegalInstruction`
- [x] Run NVFP4 MoE model on SM120 (RTX 50-series) with concurrent requests — verify no NaN/garbage output
- [x] Verify `CUDA_LAUNCH_BLOCKING=1` workaround is no longer needed
- [x] AOT build with `FLASHINFER_CUDA_ARCH_LIST="12.1a"` completes without errors
- [x] SM90 (Hopper) fused MoE tests pass: `pytest tests/moe/`
- [x] SM100 GEMM tests still pass (no regression from existing GDC flags)
## Summary by CodeRabbit
* **New Features**
* Expanded GPU kernel compilation support: enabled additional optimizations for NVIDIA SM100 and SM90 GPUs, activating dependency-control optimizations where available.
* Updated JIT/GEMM build configs to include these architecture-specific compile options, improving performance and compatibility on supported hardware.
main ← johnnynunez:main
opened 07:58AM - 28 Mar 26 UTC
## Summary
Fix `cudaErrorIllegalInstruction` when running NVFP4 models (e.g. … `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`) on SM12x GPUs (RTX 50 series SM120, DGX Spark SM121).
### Root causes
1. **CUTLASS v4.2.2 lacks SM12x NVFP4 tile constraints** — The bundled CUTLASS was missing SM120f family-level compilation support for NVFP4/MX Grouped GEMM and SM121-specific tile configurations (DGX Spark). This caused `IllegalInstruction` during decode when small-M tile variants were selected. Related upstream: [NVIDIA/cutlass#3038](https://github.com/NVIDIA/cutlass/pull/3038).
2. **FlashInfer 0.6.6 bundles CUTLASS 4.2.1** — The FlashInfer CUTLASS MoE backend failed on SM12x with `Failed to initialize cutlass TMA WS grouped gemm` due to the same missing tile constraints. Fixed upstream in [flashinfer-ai/flashinfer#2798](https://github.com/flashinfer-ai/flashinfer/pull/2798).
3. **`cutlass_scaled_mm_supports_fp4()` reported false availability** — Only checked CUDA runtime version (`>= 12080`), not whether the SM-specific kernel was actually compiled. On a build with only `ENABLE_NVFP4_SM100`, it incorrectly reported CUTLASS as available for SM12x, then failed at dispatch.
4. **Quantization kernels had no SM runtime guard** — The `scaled_fp4_quant`, `silu_and_mul_nvfp4_quant`, and expert quant entry points dispatched to `_sm1xxa` kernels if *any* SM1xx was compiled, with no runtime check. If only SM100 SASS existed, CUDA would JIT-compile SM100 PTX for SM120 (different major arch), producing illegal instructions asynchronously — surfacing later at `synchronize()` as an opaque CUDA error.
5. **FlashInfer CUTLASS backend bypassed quant kernel checks** — `select_nvfp4_linear_backend()` selected FlashInfer CUTLASS solely on `has_device_capability(100)`, without verifying the vLLM quantization kernels (used by all non-Marlin backends) were compiled for the current SM.
### Changes
| File | Change |
|---|---|
| `CMakeLists.txt` | Bump CUTLASS from v4.2.2 to **v4.4.2** — enables SM120f (family) compilation for NVFP4/MX Grouped GEMM, covering RTX 50 (SM120) and DGX Spark (SM121) |
| `docker/Dockerfile` | Bump FlashInfer from 0.6.6 to **0.6.7** (includes CUTLASS 4.4.2, fixes TMA grouped GEMM on SM12x) |
| `docker/Dockerfile.nightly_torch` | Same FlashInfer bump (source build) |
| `docker/versions.json` | `FLASHINFER_VERSION`: `0.6.6` → `0.6.7` |
| `nvfp4_scaled_mm_entry.cu` | `cutlass_scaled_mm_supports_fp4()` now checks compile-time `ENABLE_NVFP4_SM100`/`ENABLE_NVFP4_SM120` guards per SM range instead of a blanket `>= 100` check |
| `nvfp4_quant_entry.cu` | Added `nvfp4_quant_sm_supported()` runtime guard to all four quant entry points (`scaled_fp4_quant`, `scaled_fp4_experts_quant`, `silu_and_mul_nvfp4_quant`, `silu_and_mul_scaled_fp4_experts_quant`) |
| `nvfp4_utils.py` | `select_nvfp4_linear_backend()` gates FlashInfer CUTLASS on `cutlass_fp4_supported()` + adds validation assert for all FlashInfer backends |
### What is NOT changed
**Marlin remains a valid fallback on SM12x.** Marlin FP4 uses weight-only dequantization to BF16 — it does not use native FP4 tensor core instructions and works correctly on all Blackwell architectures including DGX Spark. Benchmarks confirm Marlin is stable on SM121 (~558 tok/s, on par with vLLM CUTLASS at ~562 tok/s). The Marlin path (`apply_fp4_marlin_linear`) bypasses the vLLM quant kernels entirely, so the SM guards in `nvfp4_quant_entry.cu` do not affect it.
### Behavior on SM12x after this PR
| Scenario | Before | After |
|---|---|---|
| Build includes `ENABLE_NVFP4_SM120` + CUTLASS v4.4.2 | `IllegalInstruction` | Native CUTLASS backend selected, works correctly |
| Build lacks `ENABLE_NVFP4_SM120` | `IllegalInstruction` (SM100 PTX JIT to SM120) | Native CUTLASS correctly reports unavailable; **Marlin selected as fallback** — works correctly |
| FlashInfer CUTLASS MoE on SM12x | `Failed to initialize cutlass TMA WS grouped gemm` (CUTLASS 4.2.1 in FlashInfer 0.6.6) | Works correctly with FlashInfer 0.6.7 (CUTLASS 4.4.2) |
### Follow-up: FlashInfer 0.6.8
[flashinfer-ai/flashinfer#2738](https://github.com/flashinfer-ai/flashinfer/pull/2738) (merged March 28, 2026) adds native NVFP4 and MXFP4 group GEMM support for SM120/SM121 (RTX 50 / DGX Spark) directly in FlashInfer. This will land in FlashInfer **0.6.8**. Once released, `FLASHINFER_VERSION` should be bumped in `docker/Dockerfile`, `docker/Dockerfile.nightly_torch`, and `docker/versions.json` to unlock FlashInfer's own SM12x NVFP4/MXFP4 kernels (including GDC unguarding and PDL group GEMM fixes). TODO comments have been added to both Dockerfiles tracking this.
## Test plan
- [x] Build with `CUDA_ARCHS="12.0a;12.1a"` on DGX Spark (SM121), verify NVFP4 model serves with vLLM CUTLASS backend (`VLLM_NVFP4_GEMM_BACKEND=cutlass --moe-backend=cutlass`)
- [x] Verify FlashInfer CUTLASS MoE on SM12x no longer hits TMA init error
- [x] Build with `CUDA_ARCHS="12.0a;12.1a"`, verify Marlin fallback still works (`VLLM_NVFP4_GEMM_BACKEND=marlin --moe-backend=marlin`)
- [x] Build with `CUDA_ARCHS="10.0a"` only, verify Marlin fallback on SM12x (no `IllegalInstruction`)
- [x] Verify SM100 (B200) still works with native CUTLASS (no regression from CUTLASS bump)
- [x] Verify SM89/SM90 still works (pre-Blackwell unaffected)
- [x] Run `tests/models/quantization/test_nvfp4.py` on SM120+
- [x] Docker build completes with FlashInfer 0.6.7 for both `Dockerfile` and `Dockerfile.nightly_torch`
from the vllm directory after merging:
uv venv --python 3.12 --seed .venv
source .venv/bin/activate
export TORCH_CUDA_ARCH_LIST=12.1a
uv pip install -ve . --torch-backend=auto --refresh
that’ll take a bit then
uv uninstall flashinfer-cubin flashinfer-python
from the flashinfer directory after merging:
uv pip install --no-build-isolation -e .
when cloning the source of flashinfer make sure you do it recursively since there are submodules for cutlass.
Johnny’s guide has a step that rm -rfs your .cache folder wholesale but the directories to purge before the flashinfer JIT/starting vllm are:
rm -rf ~/.cache/vllm
rm -rf ~/.cache/flashinfer
rm -rf ~/.triton
rm -rf ~/.config/vllm
If you don’t want to nuke all your models :)
4 Likes
Can you post your vLLM serve settings? Would love to give Cascade a try as well.
~/spark-vllm-docker$ uvx llama-benchy --base-url http://127.0.0.1:8000/v1 --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 --pp 2048 --depth 4096 16000 32000
Installed 49 packages in 30ms
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.5)
Date: 2026-03-30 17:26:46
Benchmarking model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 at http://127.0.0.1:8000/v1
Concurrency levels: [1]
Downloading book from https://www.gutenberg.org/files/1661/1661-0.txt...
Saved text to cache: /home/csolutions_ai/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 162015
Warming up...
Warmup (User only) complete. Delta: 16 tokens (Server: 38, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)
Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 2.35 ms
Running test: pp=2048, tg=32, depth=4096, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Printing results in MD format:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------------------------------------------|----------------:|------------------:|-------------:|------------------:|------------------:|------------------:|
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 @ d4096 | 5909.41 ± 3864.42 | | 5085.66 ± 6183.45 | 5083.31 ± 6183.45 | 5085.71 ± 6183.45 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | tg32 @ d4096 | 56.28 ± 0.04 | 58.10 ± 0.04 | | | |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 @ d16000 | 7660.58 ± 529.70 | | 2370.16 ± 172.01 | 2367.81 ± 172.01 | 2370.22 ± 172.00 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | tg32 @ d16000 | 56.46 ± 0.09 | 58.28 ± 0.10 | | | |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | pp2048 @ d32000 | 7263.28 ± 169.66 | | 4692.55 ± 111.45 | 4690.20 ± 111.45 | 4692.61 ± 111.46 |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | tg32 @ d32000 | 56.54 ± 0.10 | 58.37 ± 0.11 | | | |
llama-benchy (0.3.5)
date: 2026-03-30 17:26:46 | latency mode: api
~/spark-vllm-docker$
buying my Nvidia cap,
3 Likes
before you run these with the JIT flashinfer build you’ll want to set:
export MAX_JOBS=8 to prevent compilation from OOM
also uv pip install fastsafetensors
and
sudo bash -c “echo 8192 > /sys/block/nvme0n1/queue/read_ahead_kb”
if you want the weight loading to scream
vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 --kv-cache-dtype fp8 --trust-remote-code --gpu-memory-utilization 0.85 --max-num-seqs 512 --enable-prefix-caching --max-cudagraph-capture-size 512 --mamba-ssm-cache-dtype float32 --reasoning-parser nemotron_v3 --tool-call-parser qwen3_coder --enable-auto-tool-choice --port 8000 --host 0.0.0.0 --load-format fastsafetensors
That was what was used in the gpqa accuracy run I posted previously.
I’m comparing the impact of changing --mamba-ssm-cache-dtype float16 accuracy wise with another run currently to see if there’s a huge accuracy falloff (relative good performance bump with that but time will tell if the model turns into a potato)
if that goes alright i’m going to take it a little further with --mamba-cache-dtype down to float16 so both the ssm state and conv state get the same treatment and further see if there’s an impact.
Considering all the hot models right now are mamba/causal conv1d I’d really like to see firsthand the impact.
2 Likes
vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
--mamba_ssm_cache_dtype float32 \
--max-model-len 262144 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3 \
--kv-cache-dtype fp8
$ uvx llama-benchy --base-url http://127.0.0.1:8000/v1 --model chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 --pp 2048 --depth 4096 16000 32000 128000
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.5)
Date: 2026-03-30 17:47:43
Benchmarking model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 at http://127.0.0.1:8000/v1
Concurrency levels: [1]
Loading text from cache: /home/csolutions_ai/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 143827
Warming up...
Warmup (User only) complete. Delta: 33 tokens (Server: 55, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)
Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 5.59 ms
Running test: pp=2048, tg=32, depth=4096, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=128000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Printing results in MD format:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-------------------------------------------|-----------------:|------------------:|----------------:|------------------:|------------------:|------------------:|
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d4096 | 6178.99 ± 1076.38 | | 1028.80 ± 166.52 | 1023.21 ± 166.52 | 1029.00 ± 166.69 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | tg32 @ d4096 | 59.47 ± 1.00 | 61.71 ± 0.81 | | | |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d16000 | 7368.29 ± 503.82 | | 2467.01 ± 176.24 | 2461.42 ± 176.24 | 2467.08 ± 176.27 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | tg32 @ d16000 | 57.85 ± 1.38 | 60.52 ± 0.44 | | | |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d32000 | 7334.35 ± 52.13 | | 4648.09 ± 32.89 | 4642.50 ± 32.89 | 4648.15 ± 32.90 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | tg32 @ d32000 | 56.24 ± 2.23 | 74.63 ± 21.18 | | | |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d128000 | 4709.97 ± 20.59 | | 27455.18 ± 249.64 | 27611.77 ± 120.71 | 27618.46 ± 120.97 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | tg32 @ d128000 | 236.89 ± 115.28 | 540.34 ± 282.67 | | | |
llama-benchy (0.3.5)
date: 2026-03-30 17:47:43 | latency mode: api
I haven’t done what @trystan1 recommended; it might explode.
2 Likes
The only thing I’m currently recommending would be the purchase of a high quality nvidia baseball cap.
1 Like
The day NVIDIA finally gets NVFP4 officially working on this device (if ever), I’ll consider whether they belong on the good-guys list again. NVFP4 is essential for the DGX Spark, and honestly, it should have been ready when the Spark launched.
Until then, I’d much rather wear a community hat. The people here who invested their time and did the hard work to find workarounds when NVIDIA failed to pull its own weight are the ones who deserve the credit. You guys are amazing.
4 Likes
My guess is there will be a big push to get nvfp4 performance on gb10 between now and the shipping of the n1x systems/laptops later.
It would make sense for nvidia to make the gb10 the dev platform/laptop platform of choice for cuda.
Pure speculation on my part, but seems reasonable.
1 Like
eugr
March 30, 2026, 5:24pm
174
The performance seems to be lower than both Marlin and VLLM_CUTLASS though.
1 Like
eugr
March 30, 2026, 5:27pm
175
Looks like vLLM PR #38423 got merged, so only FI #2913 is left to merge. I’ll run my build with FI #2913 applied and if it solves the issue, then the next community build will include these changes.
6 Likes
vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-num-seqs 512 \
--enable-prefix-caching \
--max-cudagraph-capture-size 512 \
--mamba-ssm-cache-dtype float32 \
--reasoning-parser nemotron_v3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--port 8000 \
--host 0.0.0.0 \
--max-model-len 262144 \
--load-format fastsafetensors 2>&1
uvx llama-benchy --base-url http://127.0.0.1:8000/v1 --model chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 --pp 2048 --depth 4096 16000 32000 128000 256000
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.5)
Date: 2026-03-30 19:22:18
Benchmarking model: chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 at http://127.0.0.1:8000/v1
Concurrency levels: [1]
Loading text from cache: /root/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 143827
Warming up...
Warmup (User only) complete. Delta: 33 tokens (Server: 55, Local: 22)
Warmup (System+Empty) complete. Delta: 16 tokens (Server: 38, Local: 22)
Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 3.37 ms
Running test: pp=2048, tg=32, depth=4096, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=16000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=32000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=128000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=2048, tg=32, depth=256000, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Printing results in MD format:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-------------------------------------------|-----------------:|------------------:|--------------------:|-------------------:|-------------------:|-------------------:|
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d4096 | 8578.40 ± 30.83 | | 719.60 ± 2.58 | 716.23 ± 2.58 | 719.69 ± 2.58 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | tg32 @ d4096 | 56.84 ± 0.05 | 58.68 ± 0.05 | | | |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d16000 | 8011.36 ± 3.26 | | 2256.17 ± 0.92 | 2252.80 ± 0.92 | 2256.27 ± 0.92 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | tg32 @ d16000 | 56.87 ± 0.10 | 58.71 ± 0.11 | | | |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d32000 | 7346.73 ± 8.63 | | 4637.83 ± 5.45 | 4634.45 ± 5.45 | 4637.90 ± 5.47 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | tg32 @ d32000 | 55.52 ± 1.58 | 61.30 ± 4.00 | | | |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d128000 | 4757.62 ± 103.09 | | 27350.86 ± 590.90 | 27347.49 ± 590.90 | 27350.92 ± 590.91 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | tg32 @ d128000 | 8095.44 ± 6004.28 | 45761.69 ± 46158.55 | | | |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | pp2048 @ d256000 | 3056.12 ± 70.90 | | 84485.71 ± 1977.42 | 84482.34 ± 1977.42 | 84485.76 ± 1977.42 |
| chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 | tg32 @ d256000 | 3200.18 ± 3680.35 | 8432.24 ± 10133.48 | | | |
llama-benchy (0.3.5)
date: 2026-03-30 19:22:18 | latency mode: api
1 Like
I was just setting up the same.. But using @eugr spark-vllm-docker system.
In Oepenclaw the initial response was a little slow, but next few commands with tools is about twice as fast as Nano version. I did nave to dial back the –gpu-memory-utilization to 0.80. 124.2 Gig of System Mem was cutting it a little close.
eugr
March 30, 2026, 6:31pm
178
OK, looks like it’s not crashing with these two PRs, but at least for Nemotron-3-super, the performance is:
Higher for PP, e.g.: ~2140 t/s vs. ~1700 t/s with Marlin/VLLM_CUTLASS at 8192 token context
Slightly lower for TG, e.g. 14.5 vs. 15.5
I think I’ll keep the recipes pinned to Marlin/VLLM_CUTLASS for now, maybe at least until autotuner errors are gone, but will update the build to include these PRs (actually, just Flashinfer one for now, as vLLM one is merged).
model
test
t/s
peak t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048
2205.39 ± 13.67
934.82 ± 5.78
928.67 ± 5.78
934.99 ± 5.79
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32
14.42 ± 0.07
15.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048 @ d4096
2195.56 ± 7.67
2804.56 ± 9.76
2798.40 ± 9.76
2804.75 ± 9.75
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32 @ d4096
14.36 ± 0.01
15.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048 @ d8192
2159.83 ± 19.23
4747.63 ± 42.48
4741.48 ± 42.48
4747.77 ± 42.55
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32 @ d8192
14.47 ± 0.11
15.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048 @ d16384
2122.38 ± 5.93
8690.80 ± 24.31
8684.65 ± 24.31
8690.93 ± 24.28
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32 @ d16384
14.50 ± 0.19
15.33 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048 @ d32078
2003.24 ± 7.67
17041.63 ± 65.53
17035.48 ± 65.53
17041.90 ± 65.76
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32 @ d32078
14.33 ± 0.04
15.00 ± 0.00
llama-benchy (0.3.5)
date: 2026-03-30 11:29:12 | latency mode: api | pp basis: ttfr
There are open PRs that improve nvfp4 and nemotron super. No worries!
4 Likes
eugr
March 30, 2026, 9:44pm
180
Rebuilt from main again and restarted my Spark (because it crashed due to shutdown issue) - getting better performance now. Not sure what worked, but I’m definitely including this PR into the next run.
model
test
t/s
peak t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048
1868.93 ± 551.73
1240.96 ± 459.45
1231.45 ± 459.45
1241.11 ± 459.45
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32
15.27 ± 0.04
16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048 @ d4096
1552.78 ± 814.54
6943.61 ± 5704.38
6934.10 ± 5704.38
6943.71 ± 5704.39
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32 @ d4096
15.17 ± 0.02
16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048 @ d8192
2220.81 ± 6.41
4620.48 ± 13.30
4610.97 ± 13.30
4620.59 ± 13.30
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32 @ d8192
15.21 ± 0.08
16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048 @ d16384
2179.33 ± 6.15
8467.23 ± 23.89
8457.72 ± 23.89
8467.32 ± 23.88
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32 @ d16384
15.27 ± 0.17
16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
pp2048 @ d32078
2051.58 ± 6.14
16643.66 ± 49.85
16634.15 ± 49.85
16643.78 ± 49.88
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
tg32 @ d32078
15.20 ± 0.06
16.00 ± 0.00
llama-benchy (0.3.5)
date: 2026-03-30 14:42:37 | latency mode: api | pp basis: ttfr
5 Likes
eugr
March 30, 2026, 10:08pm
181
BTW, flashinfer autotuner can be disabled with --kernel_config '{"enable_flashinfer_autotune": false}'. Since it’s failing now anyway, doesn’t affect the performance in any meaningful way, but eliminates annoying error traces.
1 Like
There are several PRs coming down the pipe to boost cutlass/flashinfer kernels. That seems to be where all the development attention is going.
They all depend on one another so my hat goes off to courageous optimizers of all things cuda.
3 Likes