Any one had already deployed this one?
Downloading it now, wont get a chance to test it till later on this evening. Very interested to see how it compares with Qwen 122b.
Nemotron-3-Super-120B-NVFP4 is live on Spark 1
model: nvidia/nemotron-3-super
🔧 CHANGE:
- Model loaded: NVFP4 + FP8 KV cache, Mamba SSM active
- Backend: FLASHINFER_CUTLASS for NVFP4 GEMM
- Reasoning parser: super_v3 plugin loaded
UPDATE: not running, trying actually to get it working :)
Model loaded but CUDA error: misaligned address — crashed on first inference
UPDATE: Working! 16.6 tok/s (Marlin weight-only dequant — cuDNN path). KV cache 33.6 GiB, model 69.5 GiB. Note: GPU has no
native FP4 compute capability with this flashinfer version, so Marlin dequantizes FP4→BF16
Looks like this one scores higher than gpt-oss-120b but still lower than qwen 3.5 122B
source: Comparison of AI Models across Intelligence, Performance, and Price
@eugr What recipe should we use in your opinion ? i tried the same than Nemotron small and it crashes
Can you try with Marlin NVFP4 backend? Similar to this (just replace model name and reasoning parser):
./launch-cluster.sh --solo \
--non-privileged -j 8 \
--apply-mod mods/nemotron-nano \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--max-num-seqs 8 \
--tensor-parallel-size 1 \
--max-model-len 262144 \
--port 8888 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3 \
--kv-cache-dtype fp8 \
--load-format fastsafetensors \
--gpu-memory-utilization 0.7
I will download it later and see if I can get it working and then create a recipe
FYI: in the interim, llama.cpp enjoyers need tip-of-tree (as of a few minutes ago) to load the Q4_K_M model [Commit eaf1d79].
I’m seeing 18.5 t/s with the default context.
With a 1m token context it’s 17.6 t/s on a long response.
I will try & let you know
Thanks
Advanced deployment guide has instructions for deployment on 1x Spark with trtllm! At the bottom of this page:
First llama benchy small test
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|---|---|
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 (c1) | 1622.40 ± 97.37 | 1622.40 ± 97.37 | 1154.88 ± 18.89 | 1153.23 ± 18.89 | 1154.99 ± 18.90 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg128 (c1) | 14.94 ± 0.26 | 14.94 ± 0.26 | 16.00 ± 0.00 | 16.00 ± 0.00 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 (c2) | 1597.26 ± 83.10 | 898.89 ± 144.38 | 2076.99 ± 268.95 | 2075.34 ± 268.95 | 2077.08 ± 268.93 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg128 (c2) | 24.03 ± 0.61 | 12.31 ± 0.36 | 27.33 ± 0.94 | 13.67 ± 0.47 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_pp @ d4096 (c1) | 1622.87 ± 42.97 | 1622.87 ± 42.97 | 2353.86 ± 82.28 | 2352.21 ± 82.28 | 2353.97 ± 82.28 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_tg @ d4096 (c1) | 15.32 ± 0.15 | 15.32 ± 0.15 | 16.00 ± 0.00 | 16.00 ± 0.00 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d4096 (c1) | 587.85 ± 46.78 | 587.85 ± 46.78 | 3509.05 ± 295.70 | 3507.40 ± 295.70 | 3509.14 ± 295.70 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg128 @ d4096 (c1) | 14.65 ± 0.31 | 14.65 ± 0.31 | 15.67 ± 0.47 | 15.67 ± 0.47 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_pp @ d4096 (c2) | 1598.80 ± 28.41 | 993.60 ± 197.27 | 3821.95 ± 758.53 | 3820.30 ± 758.53 | 3822.02 ± 758.54 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_tg @ d4096 (c2) | 21.73 ± 0.24 | 11.72 ± 0.74 | 27.33 ± 0.94 | 13.67 ± 0.47 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d4096 (c2) | 594.71 ± 12.90 | 389.07 ± 91.80 | 5576.41 ± 1318.35 | 5574.75 ± 1318.35 | 5576.48 ± 1318.36 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg128 @ d4096 (c2) | 19.74 ± 0.22 | 11.22 ± 1.21 | 27.33 ± 0.94 | 13.83 ± 0.37 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_pp @ d8192 (c1) | 1654.28 ± 35.67 | 1654.28 ± 35.67 | 4534.22 ± 84.75 | 4532.56 ± 84.75 | 4534.29 ± 84.76 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_tg @ d8192 (c1) | 14.92 ± 0.53 | 14.92 ± 0.53 | 15.67 ± 0.47 | 15.67 ± 0.47 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d8192 (c1) | 365.10 ± 8.92 | 365.10 ± 8.92 | 5614.55 ± 139.13 | 5612.90 ± 139.13 | 5614.63 ± 139.12 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg128 @ d8192 (c1) | 15.55 ± 0.10 | 15.55 ± 0.10 | 16.00 ± 0.00 | 16.00 ± 0.00 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_pp @ d8192 (c2) | 1627.22 ± 52.73 | 1079.73 ± 260.68 | 7260.79 ± 1826.35 | 7259.13 ± 1826.35 | 7260.85 ± 1826.35 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_tg @ d8192 (c2) | 18.93 ± 0.37 | 11.30 ± 1.67 | 27.33 ± 0.94 | 13.83 ± 0.37 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d8192 (c2) | 371.58 ± 7.12 | 249.34 ± 63.95 | 8790.86 ± 2246.39 | 8789.20 ± 2246.39 | 8790.93 ± 2246.40 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg128 @ d8192 (c2) | 17.74 ± 0.42 | 10.99 ± 1.95 | 27.67 ± 0.47 | 13.83 ± 0.37 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_pp @ d16384 (c1) | 1654.03 ± 41.06 | 1654.03 ± 41.06 | 9100.52 ± 222.74 | 9098.87 ± 222.74 | 9100.60 ± 222.74 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_tg @ d16384 (c1) | 14.83 ± 0.22 | 14.83 ± 0.22 | 16.00 ± 0.00 | 16.00 ± 0.00 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d16384 (c1) | 198.97 ± 4.29 | 198.97 ± 4.29 | 10299.70 ± 225.20 | 10298.04 ± 225.20 | 10299.79 ± 225.20 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg128 @ d16384 (c1) | 14.95 ± 0.55 | 14.95 ± 0.55 | 16.00 ± 0.00 | 16.00 ± 0.00 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_pp @ d16384 (c2) | 1596.13 ± 15.03 | 1077.30 ± 283.11 | 14632.06 ± 3782.07 | 14630.40 ± 3782.07 | 14632.12 ± 3782.03 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | ctx_tg @ d16384 (c2) | 13.27 ± 1.69 | 9.94 ± 2.45 | 27.33 ± 0.94 | 14.17 ± 0.69 | |||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | pp2048 @ d16384 (c2) | 196.08 ± 5.42 | 137.03 ± 40.37 | 16309.77 ± 4660.60 | 16308.11 ± 4660.60 | 16309.80 ± 4660.59 | ||
| nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | tg128 @ d16384 (c2) | 13.09 ± 0.24 | 9.59 ± 2.84 | 27.33 ± 0.94 | 13.67 ± 0.47 |
I’m getting 16.6 tok/s (Marlin weight-only dequant — cuDNN path). KV cache 33.6 GiB, model 69.5 GiB.
I think their doc has some typos and it should be --reasoning-parser super_v3, no?
![]()
it runs, that’s a plus. But slower than it should…
Yes, seems so, the deployment guide page I linked also has the old reasoning parser.
Yes, I was just giving an example for a different model. Don’t have the corresponding mod for this one yet.




