Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano

Per NVIDIA: “We just launched an ultra-efficient NVFP4 precision version of Nemotron 3 Nano that delivers up to 4x higher throughput on Blackwell B200.

Using our new Quantization Aware Distillation method, the NVFP4 version achieves up to 99.4% accuracy of BF16.”

I’m wondering if anyone has experimented with this w/VLLM and what the performance looks like yet?

3 Likes

Don’t know - doesn’t work with the Nvidia container listed in the ReadMe (yes…I see the DGX spark mentioned) and doesn’t work with our Community Container either. Neither support the SM12.1 compute kernels required to use the blackwell NVFP4 compute units.

Interesting, I’ll try later today. I wonder what is different about this one.

I get this error (tried with all variants of tags/builds):

NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.

It detects flashinfer-cutlass for NVFP4 GEMM but fails at MoE layer init. Seems like the NVFP4 MoE kernels don’t support GB10 + distributed yet.

Wonder if Chris’s build may address this…might try later.

Give it a shot, but there are some differences between nvfp4 and mxfp4. They are different ‘shortcuts’ to representing a larger number - and I’ve only been playing with gpt-oss-120b and mxfp4.

I would suspect some changes to be necessary for it to work. But who knows, happy accidents can happen!

No, it doesn’t work:

 RuntimeError: Check failed: (isWMxfp4AMxfp8Quant() || isWMxfp4AFp8Quant()) is false: FLASHINFER_FUSED_MOE_MXFP4_MINIMAL only supports MXFP4 (FP8xFP4) fused MoE.

Wild to me that even Nvidia’s container doesn’t fully support it.

Par for the course I suppose.

Try to fix it :). It’s possible to do… we have AIs to help ;).

Yes, I’ve put that in my que.

Fixing a few other things atm.

My frustration is really their documentation stating it works on the DGX when it infact does not.

Somedays I like to just experiment with models vs. hermit into debug mode.

ya know?

1 Like

Totally… things are moving so quickly!

What was working weeks ago is now different! Part of the joy for me. It does make it a little hard to keep up though!

Thanks everyone for checking it out. They are supposedly releasing a Super model soon that will be 100B param (10b active) so might be an interesting model to compare w/gpt-oss-120b on the Sparks…

Disappointing is another word for it. When the head of NVidia sells a box on the strength of “It just works” and then it kind of just doesn’t. I know we’ll get there but very frustrating.

1 Like
docker run --gpus all --ipc=host --ulimit memlock=-1 --name vllm --rm -it --network host --ulimit stack=67108864 \
  -e HF_TOKEN=$HF_TOKEN \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -v $HOME/nano_v3_reasoning_parser.py:/nano_v3_reasoning_parser.py:ro \
  -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  bash -c 'vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 --kv-cache-dtype fp8 --trust-remote-code --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser-plugin /nano_v3_reasoning_parser.py --reasoning-parser nano_v3 --port 8000 --host 0.0.0.0 --max-model-len 262144 --max-num-seqs 256 --max-cudagraph-capture-size 256'

This works on mine, getting decent performance:

(198 active requests from GPQA Diamond ~ 1350 tokens/s)

(APIServer pid=1) INFO 01-30 00:01:30 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1365.5 tokens/s, Running: 198 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-30 00:01:40 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1365.9 tokens/s, Running: 198 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-30 00:01:50 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1365.8 tokens/s, Running: 198 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 0.0%

Edit:

One thing I will say regarding the Nemotron 3 nano model is that any speed benefit there is over gpt-120b is offset by the amount of tokens it requires to generate in order to think. I’m able to get ~730 tokens/sec using the same test using gpt-120b, but it only needs to generate roughly half the amount of reasoning tokens and achieves a higher score. Looking forward to the 100b nemotron model to see if it get gain any ‘token efficiency’

2 Likes

The breaking changes were introduced in this commit: [MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414) · vllm-project/vllm@42135d6 · GitHub

Hello,

I tested this on the RTX 5090, Pro 6000, DGX Spark, and Jetson Thor. It is a very good model for edge devices.

It works on both DGX Spark and Jetson Thor using the following NGC container:

sudo docker run -it --rm \
  --pull always \
  --runtime=nvidia \
  --network host \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  bash -c "wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py && \
  vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --kv-cache-dtype fp8"

You can also join the NVIDIA Jetson AI Lab Discord channel here:

Thanks, I was able to get this working. Here’s the performance test:

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 11056.06 ± 120.52 240.87 ± 2.02 185.26 ± 2.02 240.98 ± 2.00
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 56.19 ± 0.18
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d4096 11707.71 ± 13.91 405.47 ± 0.42 349.86 ± 0.42 405.55 ± 0.40
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d4096 56.07 ± 0.05
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d4096 5824.48 ± 18.54 407.23 ± 1.12 351.62 ± 1.12 407.33 ± 1.12
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d4096 55.89 ± 0.06
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d8192 11189.35 ± 906.85 792.87 ± 63.38 737.26 ± 63.38 792.98 ± 63.38
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d8192 53.88 ± 2.61
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d8192 5732.80 ± 33.88 412.87 ± 2.12 357.26 ± 2.12 412.99 ± 2.11
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d8192 55.77 ± 0.08
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d16384 10684.28 ± 208.09 1589.67 ± 30.28 1534.06 ± 30.28 1589.81 ± 30.27
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d16384 55.28 ± 0.02
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d16384 5618.24 ± 22.39 420.15 ± 1.46 364.53 ± 1.46 420.25 ± 1.45
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d16384 55.25 ± 0.07
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d32768 9548.60 ± 9.99 3487.32 ± 3.59 3431.71 ± 3.59 3487.48 ± 3.57
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d32768 54.56 ± 0.12
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d32768 6032.25 ± 23.48 395.13 ± 1.32 339.51 ± 1.32 395.24 ± 1.33
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d32768 54.36 ± 0.01
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d65535 7696.52 ± 212.30 8577.12 ± 239.73 8521.50 ± 239.73 8577.24 ± 239.73
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d65535 52.93 ± 0.04
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d65535 4116.75 ± 32.49 553.12 ± 3.94 497.51 ± 3.94 553.25 ± 3.90
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d65535 52.77 ± 0.03
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d100000 6492.68 ± 42.68 15458.19 ± 101.16 15402.57 ± 101.16 15458.36 ± 101.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d100000 51.32 ± 0.06
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d100000 1958.31 ± 15.52 1101.48 ± 8.31 1045.87 ± 8.31 1101.60 ± 8.31
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d100000 51.42 ± 0.06

llama-benchy (0.1.1)
date: 2026-01-31 13:46:07 | latency mode: generation

For reference, this is what I’m getting with gpt-oss-20b using https://github.com/christopherowen/spark-vllm-mxfp4-docker

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
openai/gpt-oss-20b pp2048 12445.04 ± 37.63 186.14 ± 0.50 164.57 ± 0.50 219.18 ± 0.73
openai/gpt-oss-20b tg32 91.20 ± 0.05
openai/gpt-oss-20b ctx_pp @ d4096 12519.82 ± 29.40 348.74 ± 0.77 327.16 ± 0.77 382.78 ± 0.51
openai/gpt-oss-20b ctx_tg @ d4096 90.49 ± 0.19
openai/gpt-oss-20b pp2048 @ d4096 10390.47 ± 39.12 218.68 ± 0.74 197.11 ± 0.74 252.93 ± 0.65
openai/gpt-oss-20b tg32 @ d4096 89.43 ± 0.23
openai/gpt-oss-20b ctx_pp @ d8192 11328.31 ± 45.43 744.73 ± 2.90 723.16 ± 2.90 778.89 ± 3.10
openai/gpt-oss-20b ctx_tg @ d8192 88.60 ± 0.15
openai/gpt-oss-20b pp2048 @ d8192 9013.26 ± 23.36 248.80 ± 0.59 227.22 ± 0.59 283.84 ± 0.71
openai/gpt-oss-20b tg32 @ d8192 87.57 ± 0.16
openai/gpt-oss-20b ctx_pp @ d16384 9849.90 ± 39.10 1684.97 ± 6.59 1663.39 ± 6.59 1720.92 ± 6.72
openai/gpt-oss-20b ctx_tg @ d16384 85.17 ± 0.11
openai/gpt-oss-20b pp2048 @ d16384 7250.57 ± 23.56 304.04 ± 0.92 282.46 ± 0.92 340.31 ± 0.50
openai/gpt-oss-20b tg32 @ d16384 84.37 ± 0.20
openai/gpt-oss-20b ctx_pp @ d32768 7983.57 ± 31.72 4126.07 ± 16.34 4104.49 ± 16.34 4163.82 ± 16.95
openai/gpt-oss-20b ctx_tg @ d32768 79.32 ± 0.23
openai/gpt-oss-20b pp2048 @ d32768 5140.71 ± 103.27 420.12 ± 7.89 398.55 ± 7.89 459.16 ± 7.46
openai/gpt-oss-20b tg32 @ d32768 78.90 ± 0.11
openai/gpt-oss-20b ctx_pp @ d65535 5765.56 ± 10.23 11388.24 ± 20.15 11366.67 ± 20.15 11431.25 ± 20.06
openai/gpt-oss-20b ctx_tg @ d65535 70.00 ± 0.18
openai/gpt-oss-20b pp2048 @ d65535 3229.81 ± 56.46 655.86 ± 10.95 634.29 ± 10.95 699.18 ± 10.28
openai/gpt-oss-20b tg32 @ d65535 69.66 ± 0.02
openai/gpt-oss-20b ctx_pp @ d100000 4433.91 ± 10.68 22575.16 ± 54.30 22553.59 ± 54.30 22623.51 ± 53.72
openai/gpt-oss-20b ctx_tg @ d100000 62.46 ± 0.15
openai/gpt-oss-20b pp2048 @ d100000 2406.26 ± 16.17 872.73 ± 5.72 851.15 ± 5.72 921.24 ± 5.95
openai/gpt-oss-20b tg32 @ d100000 61.91 ± 0.32

llama-benchy (0.1.1)
date: 2026-01-31 13:56:34 | latency mode: generation

2 Likes

Am I missing something?

Running this script just gives me

docker: Error response from daemon: unknown or invalid runtime name: nvidia

It uses old nvidia runtime format. Just replace --runtime nvidia with --gpus=all

1 Like

@brian322 please follow the instructions here to install the NVIDIA Container Toolkit:

docs.nvidia.com

Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit

and set the nvidia runtime as the default in the Docker daemon configuration file.

Hope it helps!!!