NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Any one had already deployed this one?

5 Likes

Downloading it now, wont get a chance to test it till later on this evening. Very interested to see how it compares with Qwen 122b.

3 Likes

Nemotron-3-Super-120B-NVFP4 is live on Spark 1
model: nvidia/nemotron-3-super

🔧 CHANGE:

  • Model loaded: NVFP4 + FP8 KV cache, Mamba SSM active
  • Backend: FLASHINFER_CUTLASS for NVFP4 GEMM
  • Reasoning parser: super_v3 plugin loaded

UPDATE: not running, trying actually to get it working :)

Model loaded but CUDA error: misaligned address — crashed on first inference

UPDATE: Working! 16.6 tok/s (Marlin weight-only dequant — cuDNN path). KV cache 33.6 GiB, model 69.5 GiB. Note: GPU has no
native FP4 compute capability with this flashinfer version, so Marlin dequantizes FP4→BF16

2 Likes

Looks like this one scores higher than gpt-oss-120b but still lower than qwen 3.5 122B

source: Comparison of AI Models across Intelligence, Performance, and Price

They have the comparison on the model card.

@eugr What recipe should we use in your opinion ? i tried the same than Nemotron small and it crashes

I added QWEN 27B 9B & 35B MOE

Can you try with Marlin NVFP4 backend? Similar to this (just replace model name and reasoning parser):

./launch-cluster.sh --solo \
 --non-privileged -j 8 \
--apply-mod mods/nemotron-nano \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --port 8888 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --kv-cache-dtype fp8 \
  --load-format fastsafetensors \
  --gpu-memory-utilization 0.7
2 Likes

I will download it later and see if I can get it working and then create a recipe

2 Likes

FYI: in the interim, llama.cpp enjoyers need tip-of-tree (as of a few minutes ago) to load the Q4_K_M model [Commit eaf1d79].

I’m seeing 18.5 t/s with the default context.

With a 1m token context it’s 17.6 t/s on a long response.

1 Like

I will try & let you know

Thanks

1 Like

Advanced deployment guide has instructions for deployment on 1x Spark with trtllm! At the bottom of this page:

4 Likes

It runs !

3 Likes

First llama benchy small test

model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 (c1) 1622.40 ± 97.37 1622.40 ± 97.37 1154.88 ± 18.89 1153.23 ± 18.89 1154.99 ± 18.90
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg128 (c1) 14.94 ± 0.26 14.94 ± 0.26 16.00 ± 0.00 16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 (c2) 1597.26 ± 83.10 898.89 ± 144.38 2076.99 ± 268.95 2075.34 ± 268.95 2077.08 ± 268.93
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg128 (c2) 24.03 ± 0.61 12.31 ± 0.36 27.33 ± 0.94 13.67 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_pp @ d4096 (c1) 1622.87 ± 42.97 1622.87 ± 42.97 2353.86 ± 82.28 2352.21 ± 82.28 2353.97 ± 82.28
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_tg @ d4096 (c1) 15.32 ± 0.15 15.32 ± 0.15 16.00 ± 0.00 16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d4096 (c1) 587.85 ± 46.78 587.85 ± 46.78 3509.05 ± 295.70 3507.40 ± 295.70 3509.14 ± 295.70
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg128 @ d4096 (c1) 14.65 ± 0.31 14.65 ± 0.31 15.67 ± 0.47 15.67 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_pp @ d4096 (c2) 1598.80 ± 28.41 993.60 ± 197.27 3821.95 ± 758.53 3820.30 ± 758.53 3822.02 ± 758.54
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_tg @ d4096 (c2) 21.73 ± 0.24 11.72 ± 0.74 27.33 ± 0.94 13.67 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d4096 (c2) 594.71 ± 12.90 389.07 ± 91.80 5576.41 ± 1318.35 5574.75 ± 1318.35 5576.48 ± 1318.36
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg128 @ d4096 (c2) 19.74 ± 0.22 11.22 ± 1.21 27.33 ± 0.94 13.83 ± 0.37
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_pp @ d8192 (c1) 1654.28 ± 35.67 1654.28 ± 35.67 4534.22 ± 84.75 4532.56 ± 84.75 4534.29 ± 84.76
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_tg @ d8192 (c1) 14.92 ± 0.53 14.92 ± 0.53 15.67 ± 0.47 15.67 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d8192 (c1) 365.10 ± 8.92 365.10 ± 8.92 5614.55 ± 139.13 5612.90 ± 139.13 5614.63 ± 139.12
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg128 @ d8192 (c1) 15.55 ± 0.10 15.55 ± 0.10 16.00 ± 0.00 16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_pp @ d8192 (c2) 1627.22 ± 52.73 1079.73 ± 260.68 7260.79 ± 1826.35 7259.13 ± 1826.35 7260.85 ± 1826.35
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_tg @ d8192 (c2) 18.93 ± 0.37 11.30 ± 1.67 27.33 ± 0.94 13.83 ± 0.37
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d8192 (c2) 371.58 ± 7.12 249.34 ± 63.95 8790.86 ± 2246.39 8789.20 ± 2246.39 8790.93 ± 2246.40
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg128 @ d8192 (c2) 17.74 ± 0.42 10.99 ± 1.95 27.67 ± 0.47 13.83 ± 0.37
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_pp @ d16384 (c1) 1654.03 ± 41.06 1654.03 ± 41.06 9100.52 ± 222.74 9098.87 ± 222.74 9100.60 ± 222.74
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_tg @ d16384 (c1) 14.83 ± 0.22 14.83 ± 0.22 16.00 ± 0.00 16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d16384 (c1) 198.97 ± 4.29 198.97 ± 4.29 10299.70 ± 225.20 10298.04 ± 225.20 10299.79 ± 225.20
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg128 @ d16384 (c1) 14.95 ± 0.55 14.95 ± 0.55 16.00 ± 0.00 16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_pp @ d16384 (c2) 1596.13 ± 15.03 1077.30 ± 283.11 14632.06 ± 3782.07 14630.40 ± 3782.07 14632.12 ± 3782.03
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ctx_tg @ d16384 (c2) 13.27 ± 1.69 9.94 ± 2.45 27.33 ± 0.94 14.17 ± 0.69
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 pp2048 @ d16384 (c2) 196.08 ± 5.42 137.03 ± 40.37 16309.77 ± 4660.60 16308.11 ± 4660.60 16309.80 ± 4660.59
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 tg128 @ d16384 (c2) 13.09 ± 0.24 9.59 ± 2.84 27.33 ± 0.94 13.67 ± 0.47
4 Likes

I’m getting 16.6 tok/s (Marlin weight-only dequant — cuDNN path). KV cache 33.6 GiB, model 69.5 GiB.

1 Like

almost same same

I think their doc has some typos and it should be --reasoning-parser super_v3, no?

{12EF073E-49A8-4897-A2C4-B988C55F414E}

it runs, that’s a plus. But slower than it should…

1 Like

Yes, seems so, the deployment guide page I linked also has the old reasoning parser.

Yes, I was just giving an example for a different model. Don’t have the corresponding mod for this one yet.

1 Like