I’ve been trying to get nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 to work on vllm 0.21.0
Whatever combination of options I try it generates tokens at 43 tok/s but all output is empty.
Requesting logprobs returns NaN in the response handling.
I was hoping that Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 might be a good fit for edge deployment on Thor. Vision to action in a low latency chain.
Any suggestions how to fix this? Your experiences?
On Thor, following works to run vllm with Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4. If I had your values for the variables I’d try the posted command line. I first tried vllm v0.19.0; it failed.
For the first time ever started from a nearly empty venv. I used uv to compile vllm.
#If you don't have uv, you could install it.
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv vllm2 -p 3.12.3 --seed
source vllm2/bin/activate
cd vllm
git pull
git checkout v0.21.0
MAX_JOBS=10 uv build . --index https://pypi.org/simple --index https://download.pytorch.org/whl/cu130 -v -o dist
# Ate dinner.
uv pip install dist/vllm-0.21.0-cp312-cp312-linux_aarch64.whl
vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
--trust-remote-code \
--kv-cache-dtype fp8 \
--max-model-len 16384 \
--max-num-seqs 8 \
--max-num-batched-tokens 8192 \
--reasoning-parser nemotron_v3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.65 \
--generation-config auto
@AastaLLL@adsahu@whitesscott the problem was that I had to delete the flashinfer cache.
Which I realised after trying vllm/vllm-openai:v0.21.0 as you said. Thank you.
I have a vllm docker inspired by this community build.
So I thought that might be the issue. But no.
Happy to know that vllm/vllm-openai:v0.21.0 now supports Thor. The monthly release cadence of nvcr.io/nvidia/vllm just can’t keep up with the pace of models and of vllm. Same for Package vllm · GitHub