Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano

Done.

To use, add --apply-mod mods/nemotron-nano to ./launch-cluster.sh arguments.

For example, to run nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 on a single node:

./launch-cluster.sh --solo --apply-mod mods/nemotron-nano \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  exec vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
    --max-num-seqs 8 \
    --tensor-parallel-size 1 \
    --max-model-len 262144 \
    --port 8888 --host 0.0.0.0 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser-plugin nano_v3_reasoning_parser.py \
    --reasoning-parser nano_v3 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.7 \
    --load-format fastsafetensors 
2 Likes

Thank you!

Models runs fine and extremeley fast, only thing is it keeps repeating in Opencode and fails tool calling.

BTW, I’ve just added a new mod that enables using cluster launch script with NVIDIA NGC vLLM or any other vLLM container that includes Infiniband libraries and Ray support.

To use, add --apply-mod mods/use-ngc-vllm to ./launch-cluster.sh arguments. It can be combined with other mods.
For example, to launch Nemotron Nano in the cluster using NGC container, you can use the following command:

./launch-cluster.sh \
   -t nvcr.io/nvidia/vllm:26.01-py3 \
   --apply-mod mods/use-ngc-vllm \
   --apply-mod mods/nemotron-nano \
   -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
   -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
   exec vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
       --max-model-len 262144 \
       --port 8888 --host 0.0.0.0 \
       --trust-remote-code \
       --enable-auto-tool-choice \
       --tool-call-parser qwen3_coder \
       --reasoning-parser-plugin nano_v3_reasoning_parser.py \
       --reasoning-parser nano_v3 \
       --kv-cache-dtype fp8 \
       --gpu-memory-utilization 0.7 \
       --tensor-parallel-size 2 \
       --distributed-executor-backend ray

Make sure you have the container pulled on both nodes and pull the latest changes from my repo!

At this point it doesn’t seem like NGC container performs any better for this model than a custom build, but it can be useful in some cases.

2 Likes

Forgot to mention that if you run in --solo mode, you don’t need to apply mods/use-ngc-vllm, it will work without it. The ngc mod is for cluster use only. You can apply other mods, like the nemotron one though.

Could you try the same with the NVFP4 or FP8 version?

I would like to know if it’s due to quantization with AWQ. May llm-compressor needs a better recipe.

I tested the tool calling section of NVIDIA’s example notebook:

Works as expected.

Okay, the user wants to calculate a 15% tip on a $50 bill. Let me check the tools available. There's a function called calculate_tip that takes bill_total and tip_percentage. The parameters are integers, so I need to pass 50 and 15. I should make sure the function is called correctly with those values. The user didn't mention any other details, so I don't need to ask for more info. Just plug those numbers into the tool and return the result.

[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-86c8013efccb79be', function=Function(arguments='{"bill_total": 50, "tip_percentage": 15}', name='calculate_tip'), type='function')]

May be you have to adjust temperature and top_p as recommended by the model card of the model:

temperature=1.0 and top_p=1.0 are recommended for reasoning tasks, while temperature=0.6 and top_p=0.95 are recommended for tool calling.

1 Like

Wait, didn’t know it was a thing. Again learned something!
I was just taking the commands for it to run until now, will try with those parameters!

Also is it possible to change those params midway?
Say I want it to reason some implementation plan, and then after that let it go and do its thing.

Do I have to rerun the model each time?

Temp, top_p and top_k can be set by the client per-request.

Look into this for your client. Assuming I am understanding your question, different categories of requests (i.e., tool calling) these can be set appropriately without changing the model served.

That’s exactly what I was looking for. Thank you.

I had a look at the docs of OpenCode, so aparently you can edit/create agents and config the params.

Using your latest repo with latest wheels and having no luck - maybe lasts a prompt or two then crashes.

Going to assume this is just the nature of NVFP4 and not attempt any further. My flags:

./launch-cluster.sh \ -t nvcr.io/nvidia/vllm:26.01-py3 \ --apply-mod mods/use-ngc-vllm \ --apply-mod mods/nemotron-nano \ --nodes xxx.xxx.xx.xxx,xxx.xxx.xx.xx --ib-if rocep1s0f0,roceP2p1s0f0 \ -e VLLM_USE_FLASHINFER_MOE_FP4=1 \ -e VLLM_FLASHINFER_MOE_BACKEND=throughput \ exec vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \ --max-model-len 262144 \ --port 8000 --host 0.0.0.0 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin nano_v3_reasoning_parser.py \ --reasoning-parser nano_v3 \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.7 \ --tensor-parallel-size 2 \ --distributed-executor-backend ray

(tried custom too)

You’re the fourth or fifth report I’ve seen with this model crashing under load, honestly don’t know if this is a vllm issue or Nvidia issue but with tensorrt-llm I’m running 10x slower but not crashing

I suspect it won’t crash if you use --enforce-eager, but it will be slow.

Most likely will but not worth it at this point.

Thanks!

Well, good news, it seems to be more stable with my latest build (I’m using my pytorch-base branch, I think it’s good enough to merge into main now). Successfully ran the bench up to 200K context:

model test t/s peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 4231.32 ± 2175.79 832.16 ± 667.62 827.17 ± 667.62 832.25 ± 667.63
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 59.11 ± 0.85 61.03 ± 0.87 61.03 ± 0.87
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d4096 5195.40 ± 3324.75 3083.73 ± 3585.09 3078.75 ± 3585.09 3083.85 ± 3585.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d4096 58.42 ± 0.40 60.32 ± 0.41 60.32 ± 0.41
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d4096 2707.10 ± 44.78 761.72 ± 12.58 756.74 ± 12.58 761.83 ± 12.58
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d4096 58.29 ± 0.12 60.20 ± 0.13 60.20 ± 0.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d8192 7076.94 ± 1182.46 1200.30 ± 226.22 1195.31 ± 226.22 1200.42 ± 226.18
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d8192 56.70 ± 1.86 58.55 ± 1.92 58.55 ± 1.92
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d8192 1606.96 ± 31.06 1279.91 ± 24.38 1274.92 ± 24.38 1280.06 ± 24.40
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d8192 56.39 ± 2.34 58.23 ± 2.41 58.23 ± 2.41
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d16384 7893.42 ± 48.36 2080.72 ± 12.70 2075.73 ± 12.70 2080.89 ± 12.63
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d16384 58.62 ± 1.16 60.53 ± 1.20 60.53 ± 1.20
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d16384 869.74 ± 10.27 2360.04 ± 27.97 2355.06 ± 27.97 2360.12 ± 27.97
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d16384 60.48 ± 3.21 62.46 ± 3.31 62.46 ± 3.31
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d32768 7304.12 ± 31.62 4491.30 ± 19.48 4486.32 ± 19.48 4491.38 ± 19.49
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d32768 56.70 ± 2.90 58.54 ± 3.00 58.54 ± 3.00
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d32768 425.43 ± 3.00 4819.15 ± 34.10 4814.17 ± 34.10 4819.25 ± 34.08
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d32768 56.72 ± 0.10 58.57 ± 0.10 58.57 ± 0.10
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d65535 6300.77 ± 40.69 10405.72 ± 67.16 10401.54 ± 67.16 10405.97 ± 67.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d65535 54.95 ± 0.38 56.76 ± 0.37 56.76 ± 0.37
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d65535 188.72 ± 0.95 10856.45 ± 55.10 10852.27 ± 55.10 10856.53 ± 55.09
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d65535 55.77 ± 3.61 57.58 ± 3.73 57.58 ± 3.73
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d100000 5353.06 ± 27.04 18685.58 ± 94.70 18681.39 ± 94.70 18685.70 ± 94.71
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d100000 53.91 ± 0.05 55.66 ± 0.05 55.66 ± 0.05
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d100000 105.69 ± 1.08 19383.78 ± 198.74 19379.60 ± 198.74 19383.97 ± 198.81
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d100000 53.82 ± 0.13 55.57 ± 0.14 55.57 ± 0.14
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_pp @ d200000 3707.86 ± 51.20 53953.87 ± 740.02 53949.69 ± 740.02 53954.14 ± 740.02
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 ctx_tg @ d200000 50.50 ± 0.06 52.17 ± 0.06 52.17 ± 0.06
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 pp2048 @ d200000 37.44 ± 0.38 54715.42 ± 550.76 54711.23 ± 550.76 54715.84 ± 550.91
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 tg32 @ d200000 50.41 ± 0.16 52.06 ± 0.17 52.06 ± 0.17

llama-benchy (0.3.0)
date: 2026-02-09 07:51:19 | latency mode: api

3 Likes

Awesome - what all flags did you use?

./launch-cluster.sh --solo \
  --apply-mod mods/nemotron-nano 
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  exec vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4  \
     --max-num-seqs 8  \
     --tensor-parallel-size 1 \
     --max-model-len 262144 \
     --port 8888 \
     --trust-remote-code \
     --enable-auto-tool-choice  \
     --tool-call-parser qwen3_coder \
     --reasoning-parser-plugin nano_v3_reasoning_parser.py \
     --reasoning-parser nano_v3  \
     --kv-cache-dtype fp8  \
     --load-format fastsafetensors  \
     --gpu-memory-utilization 0.7

I’ll create a recipe for this one.

You may need to rebuild your container using the latest changes in main branch though, as I just merged my pytorch branch into main and reverted my fastsafetensors patch as the proper fix is now included in vLLM.

Awesome.

Yep! Rebuilt a few moments ago.

Oh just noticed this is for solo - just drop solo for lcuster?

Yes, I haven’t tested on cluster yet though - I’m currently doing that. For cluster you will need to add -tp 2 --distributed-executor-backend ray

Ah, duh, yes. Thanks