mmos
January 29, 2026, 3:05pm
1
Per NVIDIA: “We just launched an ultra-efficient NVFP4 precision version of Nemotron 3 Nano that delivers up to 4x higher throughput on Blackwell B200.
Using our new Quantization Aware Distillation method, the NVFP4 version achieves up to 99.4% accuracy of BF16.”
I’m wondering if anyone has experimented with this w/VLLM and what the performance looks like yet?
3 Likes
Don’t know - doesn’t work with the Nvidia container listed in the ReadMe (yes…I see the DGX spark mentioned) and doesn’t work with our Community Container either. Neither support the SM12.1 compute kernels required to use the blackwell NVFP4 compute units.
eugr
January 29, 2026, 5:38pm
3
Interesting, I’ll try later today. I wonder what is different about this one.
I get this error (tried with all variants of tags/builds):
NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.
It detects flashinfer-cutlass for NVFP4 GEMM but fails at MoE layer init. Seems like the NVFP4 MoE kernels don’t support GB10 + distributed yet.
Wonder if Chris’s build may address this…might try later.
Give it a shot, but there are some differences between nvfp4 and mxfp4. They are different ‘shortcuts’ to representing a larger number - and I’ve only been playing with gpt-oss-120b and mxfp4.
I would suspect some changes to be necessary for it to work. But who knows, happy accidents can happen!
eugr
January 29, 2026, 8:58pm
7
No, it doesn’t work:
RuntimeError: Check failed: (isWMxfp4AMxfp8Quant() || isWMxfp4AFp8Quant()) is false: FLASHINFER_FUSED_MOE_MXFP4_MINIMAL only supports MXFP4 (FP8xFP4) fused MoE.
Wild to me that even Nvidia’s container doesn’t fully support it.
Par for the course I suppose.
Try to fix it :). It’s possible to do… we have AIs to help ;).
Balaxxe
January 29, 2026, 9:15pm
10
Yes, I’ve put that in my que.
Fixing a few other things atm.
My frustration is really their documentation stating it works on the DGX when it infact does not.
Somedays I like to just experiment with models vs. hermit into debug mode.
ya know?
1 Like
Balaxxe:
ya know?
Totally… things are moving so quickly!
What was working weeks ago is now different! Part of the joy for me. It does make it a little hard to keep up though!
mmos
January 29, 2026, 9:24pm
12
Thanks everyone for checking it out. They are supposedly releasing a Super model soon that will be 100B param (10b active) so might be an interesting model to compare w/gpt-oss-120b on the Sparks…
Disappointing is another word for it. When the head of NVidia sells a box on the strength of “It just works” and then it kind of just doesn’t. I know we’ll get there but very frustrating.
1 Like
docker run --gpus all --ipc=host --ulimit memlock=-1 --name vllm --rm -it --network host --ulimit stack=67108864 \
-e HF_TOKEN=$HF_TOKEN \
-e VLLM_FLASHINFER_MOE_BACKEND=throughput \
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
-v $HOME/nano_v3_reasoning_parser.py:/nano_v3_reasoning_parser.py:ro \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
nvcr.io/nvidia/vllm:25.12.post1-py3 \
bash -c 'vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 --kv-cache-dtype fp8 --trust-remote-code --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser-plugin /nano_v3_reasoning_parser.py --reasoning-parser nano_v3 --port 8000 --host 0.0.0.0 --max-model-len 262144 --max-num-seqs 256 --max-cudagraph-capture-size 256'
This works on mine, getting decent performance:
(198 active requests from GPQA Diamond ~ 1350 tokens/s)
(APIServer pid=1) INFO 01-30 00:01:30 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1365.5 tokens/s, Running: 198 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-30 00:01:40 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1365.9 tokens/s, Running: 198 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-30 00:01:50 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1365.8 tokens/s, Running: 198 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 0.0%
Edit:
One thing I will say regarding the Nemotron 3 nano model is that any speed benefit there is over gpt-120b is offset by the amount of tokens it requires to generate in order to think. I’m able to get ~730 tokens/sec using the same test using gpt-120b, but it only needs to generate roughly half the amount of reasoning tokens and achieves a higher score. Looking forward to the 100b nemotron model to see if it get gain any ‘token efficiency’
2 Likes
eugr
January 30, 2026, 12:54am
15
Hello,
I tested this on the RTX 5090, Pro 6000, DGX Spark, and Jetson Thor. It is a very good model for edge devices.
It works on both DGX Spark and Jetson Thor using the following NGC container:
sudo docker run -it --rm \
--pull always \
--runtime=nvidia \
--network host \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
-e VLLM_FLASHINFER_MOE_BACKEND=throughput \
nvcr.io/nvidia/vllm:25.12.post1-py3 \
bash -c "wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py && \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3 \
--kv-cache-dtype fp8"
You can also join the NVIDIA Jetson AI Lab Discord channel here:
mmos
January 31, 2026, 2:01pm
17
Thanks, I was able to get this working. Here’s the performance test:
model
test
t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048
11056.06 ± 120.52
240.87 ± 2.02
185.26 ± 2.02
240.98 ± 2.00
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32
56.19 ± 0.18
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d4096
11707.71 ± 13.91
405.47 ± 0.42
349.86 ± 0.42
405.55 ± 0.40
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d4096
56.07 ± 0.05
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d4096
5824.48 ± 18.54
407.23 ± 1.12
351.62 ± 1.12
407.33 ± 1.12
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d4096
55.89 ± 0.06
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d8192
11189.35 ± 906.85
792.87 ± 63.38
737.26 ± 63.38
792.98 ± 63.38
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d8192
53.88 ± 2.61
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d8192
5732.80 ± 33.88
412.87 ± 2.12
357.26 ± 2.12
412.99 ± 2.11
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d8192
55.77 ± 0.08
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d16384
10684.28 ± 208.09
1589.67 ± 30.28
1534.06 ± 30.28
1589.81 ± 30.27
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d16384
55.28 ± 0.02
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d16384
5618.24 ± 22.39
420.15 ± 1.46
364.53 ± 1.46
420.25 ± 1.45
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d16384
55.25 ± 0.07
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d32768
9548.60 ± 9.99
3487.32 ± 3.59
3431.71 ± 3.59
3487.48 ± 3.57
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d32768
54.56 ± 0.12
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d32768
6032.25 ± 23.48
395.13 ± 1.32
339.51 ± 1.32
395.24 ± 1.33
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d32768
54.36 ± 0.01
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d65535
7696.52 ± 212.30
8577.12 ± 239.73
8521.50 ± 239.73
8577.24 ± 239.73
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d65535
52.93 ± 0.04
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d65535
4116.75 ± 32.49
553.12 ± 3.94
497.51 ± 3.94
553.25 ± 3.90
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d65535
52.77 ± 0.03
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d100000
6492.68 ± 42.68
15458.19 ± 101.16
15402.57 ± 101.16
15458.36 ± 101.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d100000
51.32 ± 0.06
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d100000
1958.31 ± 15.52
1101.48 ± 8.31
1045.87 ± 8.31
1101.60 ± 8.31
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d100000
51.42 ± 0.06
llama-benchy (0.1.1)
date: 2026-01-31 13:46:07 | latency mode: generation
For reference, this is what I’m getting with gpt-oss-20b using https://github.com/christopherowen/spark-vllm-mxfp4-docker
model
test
t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
openai/gpt-oss-20b
pp2048
12445.04 ± 37.63
186.14 ± 0.50
164.57 ± 0.50
219.18 ± 0.73
openai/gpt-oss-20b
tg32
91.20 ± 0.05
openai/gpt-oss-20b
ctx_pp @ d4096
12519.82 ± 29.40
348.74 ± 0.77
327.16 ± 0.77
382.78 ± 0.51
openai/gpt-oss-20b
ctx_tg @ d4096
90.49 ± 0.19
openai/gpt-oss-20b
pp2048 @ d4096
10390.47 ± 39.12
218.68 ± 0.74
197.11 ± 0.74
252.93 ± 0.65
openai/gpt-oss-20b
tg32 @ d4096
89.43 ± 0.23
openai/gpt-oss-20b
ctx_pp @ d8192
11328.31 ± 45.43
744.73 ± 2.90
723.16 ± 2.90
778.89 ± 3.10
openai/gpt-oss-20b
ctx_tg @ d8192
88.60 ± 0.15
openai/gpt-oss-20b
pp2048 @ d8192
9013.26 ± 23.36
248.80 ± 0.59
227.22 ± 0.59
283.84 ± 0.71
openai/gpt-oss-20b
tg32 @ d8192
87.57 ± 0.16
openai/gpt-oss-20b
ctx_pp @ d16384
9849.90 ± 39.10
1684.97 ± 6.59
1663.39 ± 6.59
1720.92 ± 6.72
openai/gpt-oss-20b
ctx_tg @ d16384
85.17 ± 0.11
openai/gpt-oss-20b
pp2048 @ d16384
7250.57 ± 23.56
304.04 ± 0.92
282.46 ± 0.92
340.31 ± 0.50
openai/gpt-oss-20b
tg32 @ d16384
84.37 ± 0.20
openai/gpt-oss-20b
ctx_pp @ d32768
7983.57 ± 31.72
4126.07 ± 16.34
4104.49 ± 16.34
4163.82 ± 16.95
openai/gpt-oss-20b
ctx_tg @ d32768
79.32 ± 0.23
openai/gpt-oss-20b
pp2048 @ d32768
5140.71 ± 103.27
420.12 ± 7.89
398.55 ± 7.89
459.16 ± 7.46
openai/gpt-oss-20b
tg32 @ d32768
78.90 ± 0.11
openai/gpt-oss-20b
ctx_pp @ d65535
5765.56 ± 10.23
11388.24 ± 20.15
11366.67 ± 20.15
11431.25 ± 20.06
openai/gpt-oss-20b
ctx_tg @ d65535
70.00 ± 0.18
openai/gpt-oss-20b
pp2048 @ d65535
3229.81 ± 56.46
655.86 ± 10.95
634.29 ± 10.95
699.18 ± 10.28
openai/gpt-oss-20b
tg32 @ d65535
69.66 ± 0.02
openai/gpt-oss-20b
ctx_pp @ d100000
4433.91 ± 10.68
22575.16 ± 54.30
22553.59 ± 54.30
22623.51 ± 53.72
openai/gpt-oss-20b
ctx_tg @ d100000
62.46 ± 0.15
openai/gpt-oss-20b
pp2048 @ d100000
2406.26 ± 16.17
872.73 ± 5.72
851.15 ± 5.72
921.24 ± 5.95
openai/gpt-oss-20b
tg32 @ d100000
61.91 ± 0.32
llama-benchy (0.1.1)
date: 2026-01-31 13:56:34 | latency mode: generation
2 Likes
Am I missing something?
Running this script just gives me
docker: Error response from daemon: unknown or invalid runtime name: nvidia
eugr
February 1, 2026, 12:21am
19
It uses old nvidia runtime format. Just replace --runtime nvidia with --gpus=all
1 Like
@brian322 please follow the instructions here to install the NVIDIA Container Toolkit:
docs.nvidia.com
and set the nvidia runtime as the default in the Docker daemon configuration file.
Hope it helps!!!