NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

adi-sonusflow · March 11, 2026, 5:03pm

Any one had already deployed this one?

arctic.gus · March 11, 2026, 5:09pm

Downloading it now, wont get a chance to test it till later on this evening. Very interested to see how it compares with Qwen 122b.

adi-sonusflow · March 11, 2026, 6:01pm

Nemotron-3-Super-120B-NVFP4 is live on Spark 1
model: nvidia/nemotron-3-super

🔧 CHANGE:

Model loaded: NVFP4 + FP8 KV cache, Mamba SSM active
Backend: FLASHINFER_CUTLASS for NVFP4 GEMM
Reasoning parser: super_v3 plugin loaded

UPDATE: not running, trying actually to get it working :)

Model loaded but CUDA error: misaligned address — crashed on first inference

UPDATE: Working! 16.6 tok/s (Marlin weight-only dequant — cuDNN path). KV cache 33.6 GiB, model 69.5 GiB. Note: GPU has no
native FP4 compute capability with this flashinfer version, so Marlin dequantizes FP4→BF16

mmos · March 11, 2026, 6:02pm

Looks like this one scores higher than gpt-oss-120b but still lower than qwen 3.5 122B

source: Comparison of AI Models across Intelligence, Performance, and Price

AoE · March 11, 2026, 6:02pm

They have the comparison on the model card.

giraudremi92 · March 11, 2026, 6:20pm

@eugr What recipe should we use in your opinion ? i tried the same than Nemotron small and it crashes

giraudremi92 · March 11, 2026, 6:22pm

I added QWEN 27B 9B & 35B MOE

eugr · March 11, 2026, 6:46pm

Can you try with Marlin NVFP4 backend? Similar to this (just replace model name and reasoning parser):

./launch-cluster.sh --solo \
 --non-privileged -j 8 \
--apply-mod mods/nemotron-nano \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --port 8888 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --kv-cache-dtype fp8 \
  --load-format fastsafetensors \
  --gpu-memory-utilization 0.7

eugr · March 11, 2026, 6:47pm

I will download it later and see if I can get it working and then create a recipe

allanmac · March 11, 2026, 6:51pm

FYI: in the interim, llama.cpp enjoyers need tip-of-tree (as of a few minutes ago) to load the Q4_K_M model [Commit eaf1d79].

I’m seeing 18.5 t/s with the default context.

With a 1m token context it’s 17.6 t/s on a long response.

giraudremi92 · March 11, 2026, 7:09pm

I will try & let you know

Thanks

jwarner · March 11, 2026, 7:13pm

Advanced deployment guide has instructions for deployment on 1x Spark with trtllm! At the bottom of this page:

giraudremi92 · March 11, 2026, 7:17pm

It runs !

giraudremi92 · March 11, 2026, 7:23pm

First llama benchy small test

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 (c1)	1622.40 ± 97.37	1622.40 ± 97.37			1154.88 ± 18.89	1153.23 ± 18.89	1154.99 ± 18.90
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg128 (c1)	14.94 ± 0.26	14.94 ± 0.26	16.00 ± 0.00	16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 (c2)	1597.26 ± 83.10	898.89 ± 144.38			2076.99 ± 268.95	2075.34 ± 268.95	2077.08 ± 268.93
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg128 (c2)	24.03 ± 0.61	12.31 ± 0.36	27.33 ± 0.94	13.67 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_pp @ d4096 (c1)	1622.87 ± 42.97	1622.87 ± 42.97			2353.86 ± 82.28	2352.21 ± 82.28	2353.97 ± 82.28
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_tg @ d4096 (c1)	15.32 ± 0.15	15.32 ± 0.15	16.00 ± 0.00	16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d4096 (c1)	587.85 ± 46.78	587.85 ± 46.78			3509.05 ± 295.70	3507.40 ± 295.70	3509.14 ± 295.70
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg128 @ d4096 (c1)	14.65 ± 0.31	14.65 ± 0.31	15.67 ± 0.47	15.67 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_pp @ d4096 (c2)	1598.80 ± 28.41	993.60 ± 197.27			3821.95 ± 758.53	3820.30 ± 758.53	3822.02 ± 758.54
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_tg @ d4096 (c2)	21.73 ± 0.24	11.72 ± 0.74	27.33 ± 0.94	13.67 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d4096 (c2)	594.71 ± 12.90	389.07 ± 91.80			5576.41 ± 1318.35	5574.75 ± 1318.35	5576.48 ± 1318.36
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg128 @ d4096 (c2)	19.74 ± 0.22	11.22 ± 1.21	27.33 ± 0.94	13.83 ± 0.37
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_pp @ d8192 (c1)	1654.28 ± 35.67	1654.28 ± 35.67			4534.22 ± 84.75	4532.56 ± 84.75	4534.29 ± 84.76
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_tg @ d8192 (c1)	14.92 ± 0.53	14.92 ± 0.53	15.67 ± 0.47	15.67 ± 0.47
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d8192 (c1)	365.10 ± 8.92	365.10 ± 8.92			5614.55 ± 139.13	5612.90 ± 139.13	5614.63 ± 139.12
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg128 @ d8192 (c1)	15.55 ± 0.10	15.55 ± 0.10	16.00 ± 0.00	16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_pp @ d8192 (c2)	1627.22 ± 52.73	1079.73 ± 260.68			7260.79 ± 1826.35	7259.13 ± 1826.35	7260.85 ± 1826.35
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_tg @ d8192 (c2)	18.93 ± 0.37	11.30 ± 1.67	27.33 ± 0.94	13.83 ± 0.37
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d8192 (c2)	371.58 ± 7.12	249.34 ± 63.95			8790.86 ± 2246.39	8789.20 ± 2246.39	8790.93 ± 2246.40
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg128 @ d8192 (c2)	17.74 ± 0.42	10.99 ± 1.95	27.67 ± 0.47	13.83 ± 0.37
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_pp @ d16384 (c1)	1654.03 ± 41.06	1654.03 ± 41.06			9100.52 ± 222.74	9098.87 ± 222.74	9100.60 ± 222.74
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_tg @ d16384 (c1)	14.83 ± 0.22	14.83 ± 0.22	16.00 ± 0.00	16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d16384 (c1)	198.97 ± 4.29	198.97 ± 4.29			10299.70 ± 225.20	10298.04 ± 225.20	10299.79 ± 225.20
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg128 @ d16384 (c1)	14.95 ± 0.55	14.95 ± 0.55	16.00 ± 0.00	16.00 ± 0.00
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_pp @ d16384 (c2)	1596.13 ± 15.03	1077.30 ± 283.11			14632.06 ± 3782.07	14630.40 ± 3782.07	14632.12 ± 3782.03
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	ctx_tg @ d16384 (c2)	13.27 ± 1.69	9.94 ± 2.45	27.33 ± 0.94	14.17 ± 0.69
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	pp2048 @ d16384 (c2)	196.08 ± 5.42	137.03 ± 40.37			16309.77 ± 4660.60	16308.11 ± 4660.60	16309.80 ± 4660.59
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4	tg128 @ d16384 (c2)	13.09 ± 0.24	9.59 ± 2.84	27.33 ± 0.94	13.67 ± 0.47

adi-sonusflow · March 11, 2026, 7:26pm

I’m getting 16.6 tok/s (Marlin weight-only dequant — cuDNN path). KV cache 33.6 GiB, model 69.5 GiB.

giraudremi92 · March 11, 2026, 7:27pm

almost same same

AoE · March 11, 2026, 7:28pm

I think their doc has some typos and it should be --reasoning-parser super_v3, no?

eugr · March 11, 2026, 7:29pm

it runs, that’s a plus. But slower than it should…

jwarner · March 11, 2026, 7:29pm

Yes, seems so, the deployment guide page I linked also has the old reasoning parser.

eugr · March 11, 2026, 7:30pm

Yes, I was just giving an example for a different model. Don’t have the corresponding mod for this one yet.

Topic		Replies	Views
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	84	3264	March 20, 2026
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 DGX Spark / GB10 nemotron	31	1926	June 10, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	2181	December 22, 2025
OpenClaw w/ Nemotron-3-Super NVFP4 TensorRT inference on Spark Discussion DGX Spark / GB10 nemotron	14	1514	April 9, 2026
[Benchmark] nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 DGX Spark / GB10 Projects cuda , spark , jetson , llm , nemotron	5	1128	May 1, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8571	March 14, 2026
Nemotron-3-Super-120B-A12B-NVFP4 on single DGX Spark: 23.45 tok/s (spark-arena.com/ benhmarks) DGX Spark / GB10 cuda , benchmarks , spark , llm , nemotron , dgx , nemoclaw	6	864	May 26, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3186	December 17, 2025
Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix DGX Spark / GB10 Projects llama , nemotron	3	936	March 22, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	6191	May 4, 2026

NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Related topics