Thanks, so our issues with official FP8 quant are not related to the model itself, but it’s either a bug in the corresponding Flashinfer kernels, or the model itself is broken.
This one works just fine with my Docker, and even with pre-built nightly wheels.
It does seem to have some config issues as there are some scales in the weights that are not specified in model config, so it spits out a few warnings, plus flashinfer throws a few errors, but after that it loads just fine and seems to be working with 67 t/s which is still slower than it should be for this size, but that’s because NVFP4 support is still not working well.
Now, interestingly enough, if you read the model description, they recommend using VLLM_USE_FLASHINFER_MOE_FP4=1 and VLLM_FLASHINFER_MOE_BACKEND=throughput, the latter being the opposite from your findings.
Looks like VLLM_USE_FLASHINFER_MOE_FP4=1 is the key here (and NVIDIA recommends a similar parameter for their FP8 model), otherwise the model doesn’t load.
Also, Nemotron uses their own reasoning parser - you need that for clients to receive proper thinking blocks.
Here is what works on my builds:
Download the parser first:
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/resolve/main/nano_v3_reasoning_parser.py
Then run:
VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve cybermotaz/nemotron3-nano-nvfp4-w4a16 \
--trust-remote-code \
--kv-cache-dtype fp8 \
--load-format fastsafetensors \
--gpu-memory-utilization 0.7 \
--host 0.0.0.0 --port 8888 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3
Bench for 1 request:
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 1.83
Total input tokens: 12
Total generated tokens: 123
Request throughput (req/s): 0.55
Output token throughput (tok/s): 67.16
Peak output token throughput (tok/s): 66.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 73.71
---------------Time to First Token----------------
Mean TTFT (ms): 51.97
Median TTFT (ms): 51.97
P99 TTFT (ms): 51.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.58
Median TPOT (ms): 14.58
P99 TPOT (ms): 14.58
---------------Inter-token Latency----------------
Mean ITL (ms): 14.58
Median ITL (ms): 14.51
P99 ITL (ms): 16.99
==================================================