Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano

mmos · January 29, 2026, 3:05pm

Per NVIDIA: “We just launched an ultra-efficient NVFP4 precision version of Nemotron 3 Nano that delivers up to 4x higher throughput on Blackwell B200.

Using our new Quantization Aware Distillation method, the NVFP4 version achieves up to 99.4% accuracy of BF16.”

I’m wondering if anyone has experimented with this w/VLLM and what the performance looks like yet?

Balaxxe · January 29, 2026, 4:55pm

Don’t know - doesn’t work with the Nvidia container listed in the ReadMe (yes…I see the DGX spark mentioned) and doesn’t work with our Community Container either. Neither support the SM12.1 compute kernels required to use the blackwell NVFP4 compute units.

eugr · January 29, 2026, 5:38pm

Interesting, I’ll try later today. I wonder what is different about this one.

Balaxxe · January 29, 2026, 6:32pm

I get this error (tried with all variants of tags/builds):

NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.

It detects flashinfer-cutlass for NVFP4 GEMM but fails at MoE layer init. Seems like the NVFP4 MoE kernels don’t support GB10 + distributed yet.

Balaxxe · January 29, 2026, 6:58pm

Wonder if Chris’s build may address this…might try later.

christopher_owen · January 29, 2026, 8:40pm

Give it a shot, but there are some differences between nvfp4 and mxfp4. They are different ‘shortcuts’ to representing a larger number - and I’ve only been playing with gpt-oss-120b and mxfp4.

I would suspect some changes to be necessary for it to work. But who knows, happy accidents can happen!

eugr · January 29, 2026, 8:58pm

No, it doesn’t work:

 RuntimeError: Check failed: (isWMxfp4AMxfp8Quant() || isWMxfp4AFp8Quant()) is false: FLASHINFER_FUSED_MOE_MXFP4_MINIMAL only supports MXFP4 (FP8xFP4) fused MoE.

Balaxxe · January 29, 2026, 9:07pm

Wild to me that even Nvidia’s container doesn’t fully support it.

Par for the course I suppose.

christopher_owen · January 29, 2026, 9:12pm

Try to fix it :). It’s possible to do… we have AIs to help ;).

Balaxxe · January 29, 2026, 9:15pm

Yes, I’ve put that in my que.

Fixing a few other things atm.

My frustration is really their documentation stating it works on the DGX when it infact does not.

Somedays I like to just experiment with models vs. hermit into debug mode.

ya know?

christopher_owen · January 29, 2026, 9:17pm

Totally… things are moving so quickly!

What was working weeks ago is now different! Part of the joy for me. It does make it a little hard to keep up though!

mmos · January 29, 2026, 9:24pm

Thanks everyone for checking it out. They are supposedly releasing a Super model soon that will be 100B param (10b active) so might be an interesting model to compare w/gpt-oss-120b on the Sparks…

LuckyChap · January 29, 2026, 10:10pm

Disappointing is another word for it. When the head of NVidia sells a box on the strength of “It just works” and then it kind of just doesn’t. I know we’ll get there but very frustrating.

trystan1 · January 30, 2026, 12:02am

docker run --gpus all --ipc=host --ulimit memlock=-1 --name vllm --rm -it --network host --ulimit stack=67108864 \
  -e HF_TOKEN=$HF_TOKEN \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -v $HOME/nano_v3_reasoning_parser.py:/nano_v3_reasoning_parser.py:ro \
  -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  bash -c 'vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 --kv-cache-dtype fp8 --trust-remote-code --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser-plugin /nano_v3_reasoning_parser.py --reasoning-parser nano_v3 --port 8000 --host 0.0.0.0 --max-model-len 262144 --max-num-seqs 256 --max-cudagraph-capture-size 256'

This works on mine, getting decent performance:

(198 active requests from GPQA Diamond ~ 1350 tokens/s)

(APIServer pid=1) INFO 01-30 00:01:30 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1365.5 tokens/s, Running: 198 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-30 00:01:40 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1365.9 tokens/s, Running: 198 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-30 00:01:50 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1365.8 tokens/s, Running: 198 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 0.0%

Edit:

One thing I will say regarding the Nemotron 3 nano model is that any speed benefit there is over gpt-120b is offset by the amount of tokens it requires to generate in order to think. I’m able to get ~730 tokens/sec using the same test using gpt-120b, but it only needs to generate roughly half the amount of reasoning tokens and achieves a higher score. Looking forward to the 100b nemotron model to see if it get gain any ‘token efficiency’

eugr · January 30, 2026, 12:54am

The breaking changes were introduced in this commit: [MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414) · vllm-project/vllm@42135d6 · GitHub

shahizat · January 31, 2026, 12:39pm

Hello,

I tested this on the RTX 5090, Pro 6000, DGX Spark, and Jetson Thor. It is a very good model for edge devices.

It works on both DGX Spark and Jetson Thor using the following NGC container:

sudo docker run -it --rm \
  --pull always \
  --runtime=nvidia \
  --network host \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  bash -c "wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py && \
  vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --kv-cache-dtype fp8"

You can also join the NVIDIA Jetson AI Lab Discord channel here:

mmos · January 31, 2026, 2:01pm

Thanks, I was able to get this working. Here’s the performance test:

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048	11056.06 ± 120.52	240.87 ± 2.02	185.26 ± 2.02	240.98 ± 2.00
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32	56.19 ± 0.18
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d4096	11707.71 ± 13.91	405.47 ± 0.42	349.86 ± 0.42	405.55 ± 0.40
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d4096	56.07 ± 0.05
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d4096	5824.48 ± 18.54	407.23 ± 1.12	351.62 ± 1.12	407.33 ± 1.12
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d4096	55.89 ± 0.06
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d8192	11189.35 ± 906.85	792.87 ± 63.38	737.26 ± 63.38	792.98 ± 63.38
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d8192	53.88 ± 2.61
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d8192	5732.80 ± 33.88	412.87 ± 2.12	357.26 ± 2.12	412.99 ± 2.11
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d8192	55.77 ± 0.08
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d16384	10684.28 ± 208.09	1589.67 ± 30.28	1534.06 ± 30.28	1589.81 ± 30.27
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d16384	55.28 ± 0.02
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d16384	5618.24 ± 22.39	420.15 ± 1.46	364.53 ± 1.46	420.25 ± 1.45
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d16384	55.25 ± 0.07
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d32768	9548.60 ± 9.99	3487.32 ± 3.59	3431.71 ± 3.59	3487.48 ± 3.57
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d32768	54.56 ± 0.12
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d32768	6032.25 ± 23.48	395.13 ± 1.32	339.51 ± 1.32	395.24 ± 1.33
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d32768	54.36 ± 0.01
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d65535	7696.52 ± 212.30	8577.12 ± 239.73	8521.50 ± 239.73	8577.24 ± 239.73
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d65535	52.93 ± 0.04
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d65535	4116.75 ± 32.49	553.12 ± 3.94	497.51 ± 3.94	553.25 ± 3.90
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d65535	52.77 ± 0.03
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d100000	6492.68 ± 42.68	15458.19 ± 101.16	15402.57 ± 101.16	15458.36 ± 101.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d100000	51.32 ± 0.06
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d100000	1958.31 ± 15.52	1101.48 ± 8.31	1045.87 ± 8.31	1101.60 ± 8.31
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d100000	51.42 ± 0.06

llama-benchy (0.1.1)
date: 2026-01-31 13:46:07 | latency mode: generation

For reference, this is what I’m getting with gpt-oss-20b using https://github.com/christopherowen/spark-vllm-mxfp4-docker

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
openai/gpt-oss-20b	pp2048	12445.04 ± 37.63	186.14 ± 0.50	164.57 ± 0.50	219.18 ± 0.73
openai/gpt-oss-20b	tg32	91.20 ± 0.05
openai/gpt-oss-20b	ctx_pp @ d4096	12519.82 ± 29.40	348.74 ± 0.77	327.16 ± 0.77	382.78 ± 0.51
openai/gpt-oss-20b	ctx_tg @ d4096	90.49 ± 0.19
openai/gpt-oss-20b	pp2048 @ d4096	10390.47 ± 39.12	218.68 ± 0.74	197.11 ± 0.74	252.93 ± 0.65
openai/gpt-oss-20b	tg32 @ d4096	89.43 ± 0.23
openai/gpt-oss-20b	ctx_pp @ d8192	11328.31 ± 45.43	744.73 ± 2.90	723.16 ± 2.90	778.89 ± 3.10
openai/gpt-oss-20b	ctx_tg @ d8192	88.60 ± 0.15
openai/gpt-oss-20b	pp2048 @ d8192	9013.26 ± 23.36	248.80 ± 0.59	227.22 ± 0.59	283.84 ± 0.71
openai/gpt-oss-20b	tg32 @ d8192	87.57 ± 0.16
openai/gpt-oss-20b	ctx_pp @ d16384	9849.90 ± 39.10	1684.97 ± 6.59	1663.39 ± 6.59	1720.92 ± 6.72
openai/gpt-oss-20b	ctx_tg @ d16384	85.17 ± 0.11
openai/gpt-oss-20b	pp2048 @ d16384	7250.57 ± 23.56	304.04 ± 0.92	282.46 ± 0.92	340.31 ± 0.50
openai/gpt-oss-20b	tg32 @ d16384	84.37 ± 0.20
openai/gpt-oss-20b	ctx_pp @ d32768	7983.57 ± 31.72	4126.07 ± 16.34	4104.49 ± 16.34	4163.82 ± 16.95
openai/gpt-oss-20b	ctx_tg @ d32768	79.32 ± 0.23
openai/gpt-oss-20b	pp2048 @ d32768	5140.71 ± 103.27	420.12 ± 7.89	398.55 ± 7.89	459.16 ± 7.46
openai/gpt-oss-20b	tg32 @ d32768	78.90 ± 0.11
openai/gpt-oss-20b	ctx_pp @ d65535	5765.56 ± 10.23	11388.24 ± 20.15	11366.67 ± 20.15	11431.25 ± 20.06
openai/gpt-oss-20b	ctx_tg @ d65535	70.00 ± 0.18
openai/gpt-oss-20b	pp2048 @ d65535	3229.81 ± 56.46	655.86 ± 10.95	634.29 ± 10.95	699.18 ± 10.28
openai/gpt-oss-20b	tg32 @ d65535	69.66 ± 0.02
openai/gpt-oss-20b	ctx_pp @ d100000	4433.91 ± 10.68	22575.16 ± 54.30	22553.59 ± 54.30	22623.51 ± 53.72
openai/gpt-oss-20b	ctx_tg @ d100000	62.46 ± 0.15
openai/gpt-oss-20b	pp2048 @ d100000	2406.26 ± 16.17	872.73 ± 5.72	851.15 ± 5.72	921.24 ± 5.95
openai/gpt-oss-20b	tg32 @ d100000	61.91 ± 0.32

llama-benchy (0.1.1)
date: 2026-01-31 13:56:34 | latency mode: generation

brian322 · January 31, 2026, 8:16pm

Am I missing something?

Running this script just gives me

docker: Error response from daemon: unknown or invalid runtime name: nvidia

eugr · February 1, 2026, 12:21am

It uses old nvidia runtime format. Just replace --runtime nvidia with --gpus=all

shahizat · February 1, 2026, 4:44am

@brian322 please follow the instructions here to install the NVIDIA Container Toolkit:

docs.nvidia.com

Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit

and set the nvidia runtime as the default in the Docker daemon configuration file.

Hope it helps!!!

Topic		Replies	Views
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	7968	March 31, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	1722	December 22, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	224	8203	April 7, 2026
Unable to run Nemotron AGX Thor Dev Kit Jetson Thor generative_ai , nemotron	10	167	March 11, 2026
nvidia/Nemotron-Cascade-2-30B-A3B yet another model to test DGX Spark / GB10 nemotron	19	1315	March 24, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6144	March 28, 2026
Nemotron 3 Super Improvements and Fixes NVIDIA Nemotron nim , nemotron	5	211	April 7, 2026
Testing Nemotron 3 Nano Models on Nvidia DGX Spark/Jetson Thor with vLLM and FlashInfer DGX Spark / GB10 jetson , nemotron	3	430	February 15, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	73	3994	April 10, 2026
Announcing new VLLM container & 3.5X increase in Gen AI Performance in just 5 weeks of Jetson AGX Thor Launch Jetson Thor jetson , llama-31-8b-instruct , llama , deepseek , nemotron	46	3654	December 14, 2025

Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano

Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit

Related topics