Nemotron Nano Omni on Thor?

I’ve been trying to get nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 to work on vllm 0.21.0
Whatever combination of options I try it generates tokens at 43 tok/s but all output is empty.
Requesting logprobs returns NaN in the response handling.

I was hoping that Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 might be a good fit for edge deployment on Thor. Vision to action in a low latency chain.

Any suggestions how to fix this? Your experiences?

 vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4  \
     --max-model-len {max_model_len} \
     --port {port} --host {host} \
     --trust-remote-code \
     --kv-cache-dtype fp8 \
     --dtype auto \
     --max_num_seqs 8 \
     --load-format fastsafetensors  \
     --max-num-batched-tokens 32768 \
     --enable-prefix-caching \
     --limit-mm-per-prompt '{{"video":1,"image":1,"audio":1}}' \
     --media-io-kwargs '{{"video":{{"num_frames":256,"fps":2}}}}' \
     --video-pruning-rate 0.5 \
     --reasoning-parser nemotron_v3  \
     --enable-auto-tool-choice \
     --tool-call-parser qwen3_coder \
     --gpu-memory-utilization {gpu_memory_utilization} \
     --generation-config vllm \
     --default-chat-template-kwargs '{{"enable_thinking": false}}'

References I found:

On Thor, following works to run vllm with Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4. If I had your values for the variables I’d try the posted command line. I first tried vllm v0.19.0; it failed.

For the first time ever started from a nearly empty venv. I used uv to compile vllm.

#If you don't have uv, you could install it.
curl -LsSf https://astral.sh/uv/install.sh | sh

uv venv vllm2 -p 3.12.3 --seed
source vllm2/bin/activate

cd vllm
git pull
git checkout v0.21.0

MAX_JOBS=10 uv build . --index https://pypi.org/simple --index https://download.pytorch.org/whl/cu130 -v -o dist

# Ate dinner.

uv pip install dist/vllm-0.21.0-cp312-cp312-linux_aarch64.whl

vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --max-model-len 16384 \
  --max-num-seqs 8 \
  --max-num-batched-tokens 8192 \
  --reasoning-parser nemotron_v3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --gpu-memory-utilization 0.65 \
  --generation-config auto

Hi,

Have you tried the command shared in our tutorial below?
It deploy the model with llama.cpp instead of vllm:

# Serve command
sudo docker run -it --rm --pull always \
--runtime=nvidia --network host \
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
llama-server \
--hf-repo ggml-org/NVIDIA-Nemotron-3-Nano-Omni \
--hf-file nemotron-3-nano-omni-ga_v1.0-Q4_K_M.gguf \
--ctx-size 8192 \
--port 8080 \
--alias my_model \
--n-gpu-layers 999

Thanks.

This works.

vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --max-model-len 8192 \
  --port 8000 \
  --host 127.0.0.1 \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --max-num-seqs 8 \
  --max-num-batched-tokens 4096 \
  --limit-mm-per-prompt '{"video":1,"image":1,"audio":1}' \
  --media-io-kwargs '{"video":{"num_frames":256,"fps":2}}' \
  --video-pruning-rate 0.5 \
  --reasoning-parser nemotron_v3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --gpu-memory-utilization 0.65 \
  --generation-config auto \
  --default-chat-template-kwargs '{"enable_thinking": false}'

@whitesscott Thank you. I’ll try it out.

By the look of it the biggest difference in your command is --generation-config auto
My configuration did start up but didn’t respond to requests.

@AastaLLL any reason you’re pointing me towards llama.cpp instead of vllm?
vllm is mentioned in the model card. So I thought that should work.

Here’s a draft gradio.app as a front end; still needs work
app.py.txt (10.9 KB)
.
to use: pip install gradio requests

Hi,

We verify the model with llama.cpp so it should work correctly.
Thanks.

Hi @ joost-de-v

Please try this for vllm 0.21 - Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 on Jetson AGX Thor:

sudo docker run -it --rm --pull always \
  --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  --entrypoint bash \
  vllm/vllm-openai:v0.21.0 \
  -c "pip install -q 'vllm[audio]' && vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
    --max-model-len 8192 --port 8000 --host 127.0.0.1 \
    --trust-remote-code \
    --kv-cache-dtype fp8 \
    --max-num-seqs 8 --max-num-batched-tokens 4096 \
    --limit-mm-per-prompt '{\"video\":1,\"image\":1,\"audio\":1}' \
    --media-io-kwargs '{\"video\":{\"num_frames\":256,\"fps\":2}}' \
    --video-pruning-rate 0.5 \
    --reasoning-parser nemotron_v3 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder \
    --gpu-memory-utilization 0.45 \
    --generation-config auto \
    --default-chat-template-kwargs '{\"enable_thinking\": false}'"

@whitesscott @adsahu thank you. I tried your command. Same problem.
So I guess the issue is in my sm 11.0a vllm docker build.

Hi,

Do you use a local vLLM tool?
If yes, could you try the @adsha command with a container?

Thanks.

@AastaLLL @adsahu @whitesscott the problem was that I had to delete the flashinfer cache.
Which I realised after trying vllm/vllm-openai:v0.21.0 as you said. Thank you.

I have a vllm docker inspired by this community build.
So I thought that might be the issue. But no.

Happy to know that vllm/vllm-openai:v0.21.0 now supports Thor. The monthly release cadence of nvcr.io/nvidia/vllm just can’t keep up with the pace of models and of vllm. Same for Package vllm · GitHub