Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano

eugr · February 2, 2026, 7:58pm

Done.

To use, add --apply-mod mods/nemotron-nano to ./launch-cluster.sh arguments.

For example, to run nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 on a single node:

./launch-cluster.sh --solo --apply-mod mods/nemotron-nano \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  exec vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
    --max-num-seqs 8 \
    --tensor-parallel-size 1 \
    --max-model-len 262144 \
    --port 8888 --host 0.0.0.0 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser-plugin nano_v3_reasoning_parser.py \
    --reasoning-parser nano_v3 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.7 \
    --load-format fastsafetensors

tatamiso · February 2, 2026, 8:01pm

Thank you!

Models runs fine and extremeley fast, only thing is it keeps repeating in Opencode and fails tool calling.

eugr · February 3, 2026, 12:59am

BTW, I’ve just added a new mod that enables using cluster launch script with NVIDIA NGC vLLM or any other vLLM container that includes Infiniband libraries and Ray support.

To use, add --apply-mod mods/use-ngc-vllm to ./launch-cluster.sh arguments. It can be combined with other mods.
For example, to launch Nemotron Nano in the cluster using NGC container, you can use the following command:

./launch-cluster.sh \
   -t nvcr.io/nvidia/vllm:26.01-py3 \
   --apply-mod mods/use-ngc-vllm \
   --apply-mod mods/nemotron-nano \
   -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
   -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
   exec vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
       --max-model-len 262144 \
       --port 8888 --host 0.0.0.0 \
       --trust-remote-code \
       --enable-auto-tool-choice \
       --tool-call-parser qwen3_coder \
       --reasoning-parser-plugin nano_v3_reasoning_parser.py \
       --reasoning-parser nano_v3 \
       --kv-cache-dtype fp8 \
       --gpu-memory-utilization 0.7 \
       --tensor-parallel-size 2 \
       --distributed-executor-backend ray

Make sure you have the container pulled on both nodes and pull the latest changes from my repo!

At this point it doesn’t seem like NGC container performs any better for this model than a custom build, but it can be useful in some cases.

eugr · February 3, 2026, 5:49am

Forgot to mention that if you run in --solo mode, you don’t need to apply mods/use-ngc-vllm, it will work without it. The ngc mod is for cluster use only. You can apply other mods, like the nemotron one though.

cosinus · February 3, 2026, 7:16am

Could you try the same with the NVFP4 or FP8 version?

I would like to know if it’s due to quantization with AWQ. May llm-compressor needs a better recipe.

cosinus · February 3, 2026, 6:42pm

I tested the tool calling section of NVIDIA’s example notebook:

github.com/NVIDIA-NeMo/Nemotron

usage-cookbook/Nemotron-3-Nano/vllm_cookbook.ipynb

main

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploying NVIDIA Nemotron-3-Nano with vLLM\n",
    "\n",
    "This notebook will walk you through how to run the `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B` model with vLLM.\n",
    "\n",
    "[vLLM](https://docs.vllm.ai) is a fast and easy-to-use library for LLM inference and serving. \n",
    "\n",
    "For more details on the model [click here](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8)\n",
    "\n",
    "Prerequisites:\n",
    "- NVIDIA GPU with recent drivers (≥ 64 GB VRAM for BF16, ≥ 32 GB for FP8, ≥ 20 GB for NVFP4) and CUDA 12.x\n",
    "- Python 3.10+"
   ]
  },
  {

This file has been truncated. show original

Works as expected.

Okay, the user wants to calculate a 15% tip on a $50 bill. Let me check the tools available. There's a function called calculate_tip that takes bill_total and tip_percentage. The parameters are integers, so I need to pass 50 and 15. I should make sure the function is called correctly with those values. The user didn't mention any other details, so I don't need to ask for more info. Just plug those numbers into the tool and return the result.

[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-86c8013efccb79be', function=Function(arguments='{"bill_total": 50, "tip_percentage": 15}', name='calculate_tip'), type='function')]

May be you have to adjust temperature and top_p as recommended by the model card of the model:

temperature=1.0 and top_p=1.0 are recommended for reasoning tasks, while temperature=0.6 and top_p=0.95 are recommended for tool calling.

tatamiso · February 3, 2026, 6:52pm

Wait, didn’t know it was a thing. Again learned something!
I was just taking the commands for it to run until now, will try with those parameters!

Also is it possible to change those params midway?
Say I want it to reason some implementation plan, and then after that let it go and do its thing.

Do I have to rerun the model each time?

joshua.dale.warner · February 4, 2026, 5:17am

Temp, top_p and top_k can be set by the client per-request.

Look into this for your client. Assuming I am understanding your question, different categories of requests (i.e., tool calling) these can be set appropriately without changing the model served.

tatamiso · February 4, 2026, 3:34pm

That’s exactly what I was looking for. Thank you.

I had a look at the docs of OpenCode, so aparently you can edit/create agents and config the params.

Balaxxe · February 5, 2026, 10:58pm

Using your latest repo with latest wheels and having no luck - maybe lasts a prompt or two then crashes.

Going to assume this is just the nature of NVFP4 and not attempt any further. My flags:

./launch-cluster.sh \ -t nvcr.io/nvidia/vllm:26.01-py3 \ --apply-mod mods/use-ngc-vllm \ --apply-mod mods/nemotron-nano \ --nodes xxx.xxx.xx.xxx,xxx.xxx.xx.xx --ib-if rocep1s0f0,roceP2p1s0f0 \ -e VLLM_USE_FLASHINFER_MOE_FP4=1 \ -e VLLM_FLASHINFER_MOE_BACKEND=throughput \ exec vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \ --max-model-len 262144 \ --port 8000 --host 0.0.0.0 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin nano_v3_reasoning_parser.py \ --reasoning-parser nano_v3 \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.7 \ --tensor-parallel-size 2 \ --distributed-executor-backend ray

(tried custom too)

trystan1 · February 5, 2026, 11:04pm

You’re the fourth or fifth report I’ve seen with this model crashing under load, honestly don’t know if this is a vllm issue or Nvidia issue but with tensorrt-llm I’m running 10x slower but not crashing

eugr · February 5, 2026, 11:17pm

I suspect it won’t crash if you use --enforce-eager, but it will be slow.

Balaxxe · February 6, 2026, 5:09am

Most likely will but not worth it at this point.

Thanks!

eugr · February 9, 2026, 4:49pm

Well, good news, it seems to be more stable with my latest build (I’m using my pytorch-base branch, I think it’s good enough to merge into main now). Successfully ran the bench up to 200K context:

model	test	t/s	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048	4231.32 ± 2175.79			832.16 ± 667.62	827.17 ± 667.62	832.25 ± 667.63
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32	59.11 ± 0.85	61.03 ± 0.87	61.03 ± 0.87
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d4096	5195.40 ± 3324.75			3083.73 ± 3585.09	3078.75 ± 3585.09	3083.85 ± 3585.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d4096	58.42 ± 0.40	60.32 ± 0.41	60.32 ± 0.41
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d4096	2707.10 ± 44.78			761.72 ± 12.58	756.74 ± 12.58	761.83 ± 12.58
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d4096	58.29 ± 0.12	60.20 ± 0.13	60.20 ± 0.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d8192	7076.94 ± 1182.46			1200.30 ± 226.22	1195.31 ± 226.22	1200.42 ± 226.18
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d8192	56.70 ± 1.86	58.55 ± 1.92	58.55 ± 1.92
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d8192	1606.96 ± 31.06			1279.91 ± 24.38	1274.92 ± 24.38	1280.06 ± 24.40
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d8192	56.39 ± 2.34	58.23 ± 2.41	58.23 ± 2.41
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d16384	7893.42 ± 48.36			2080.72 ± 12.70	2075.73 ± 12.70	2080.89 ± 12.63
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d16384	58.62 ± 1.16	60.53 ± 1.20	60.53 ± 1.20
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d16384	869.74 ± 10.27			2360.04 ± 27.97	2355.06 ± 27.97	2360.12 ± 27.97
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d16384	60.48 ± 3.21	62.46 ± 3.31	62.46 ± 3.31
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d32768	7304.12 ± 31.62			4491.30 ± 19.48	4486.32 ± 19.48	4491.38 ± 19.49
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d32768	56.70 ± 2.90	58.54 ± 3.00	58.54 ± 3.00
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d32768	425.43 ± 3.00			4819.15 ± 34.10	4814.17 ± 34.10	4819.25 ± 34.08
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d32768	56.72 ± 0.10	58.57 ± 0.10	58.57 ± 0.10
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d65535	6300.77 ± 40.69			10405.72 ± 67.16	10401.54 ± 67.16	10405.97 ± 67.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d65535	54.95 ± 0.38	56.76 ± 0.37	56.76 ± 0.37
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d65535	188.72 ± 0.95			10856.45 ± 55.10	10852.27 ± 55.10	10856.53 ± 55.09
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d65535	55.77 ± 3.61	57.58 ± 3.73	57.58 ± 3.73
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d100000	5353.06 ± 27.04			18685.58 ± 94.70	18681.39 ± 94.70	18685.70 ± 94.71
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d100000	53.91 ± 0.05	55.66 ± 0.05	55.66 ± 0.05
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d100000	105.69 ± 1.08			19383.78 ± 198.74	19379.60 ± 198.74	19383.97 ± 198.81
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d100000	53.82 ± 0.13	55.57 ± 0.14	55.57 ± 0.14
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_pp @ d200000	3707.86 ± 51.20			53953.87 ± 740.02	53949.69 ± 740.02	53954.14 ± 740.02
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	ctx_tg @ d200000	50.50 ± 0.06	52.17 ± 0.06	52.17 ± 0.06
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	pp2048 @ d200000	37.44 ± 0.38			54715.42 ± 550.76	54711.23 ± 550.76	54715.84 ± 550.91
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	tg32 @ d200000	50.41 ± 0.16	52.06 ± 0.17	52.06 ± 0.17

llama-benchy (0.3.0)
date: 2026-02-09 07:51:19 | latency mode: api

Balaxxe · February 9, 2026, 6:53pm

Awesome - what all flags did you use?

eugr · February 9, 2026, 6:58pm

./launch-cluster.sh --solo \
  --apply-mod mods/nemotron-nano 
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  exec vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4  \
     --max-num-seqs 8  \
     --tensor-parallel-size 1 \
     --max-model-len 262144 \
     --port 8888 \
     --trust-remote-code \
     --enable-auto-tool-choice  \
     --tool-call-parser qwen3_coder \
     --reasoning-parser-plugin nano_v3_reasoning_parser.py \
     --reasoning-parser nano_v3  \
     --kv-cache-dtype fp8  \
     --load-format fastsafetensors  \
     --gpu-memory-utilization 0.7

I’ll create a recipe for this one.

You may need to rebuild your container using the latest changes in main branch though, as I just merged my pytorch branch into main and reverted my fastsafetensors patch as the proper fix is now included in vLLM.

Balaxxe · February 9, 2026, 6:58pm

Awesome.

Yep! Rebuilt a few moments ago.

Balaxxe · February 9, 2026, 6:59pm

Oh just noticed this is for solo - just drop solo for lcuster?

eugr · February 9, 2026, 6:59pm

Yes, I haven’t tested on cluster yet though - I’m currently doing that. For cluster you will need to add -tp 2 --distributed-executor-backend ray

Balaxxe · February 9, 2026, 7:01pm

Ah, duh, yes. Thanks

Topic		Replies	Views
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	7976	March 31, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	1722	December 22, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	224	8217	April 7, 2026
Unable to run Nemotron AGX Thor Dev Kit Jetson Thor generative_ai , nemotron	10	167	March 11, 2026
nvidia/Nemotron-Cascade-2-30B-A3B yet another model to test DGX Spark / GB10 nemotron	19	1316	March 24, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6149	March 28, 2026
Nemotron 3 Super Improvements and Fixes NVIDIA Nemotron nim , nemotron	5	214	April 7, 2026
Testing Nemotron 3 Nano Models on Nvidia DGX Spark/Jetson Thor with vLLM and FlashInfer DGX Spark / GB10 jetson , nemotron	3	432	February 15, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4018	April 11, 2026
Announcing new VLLM container & 3.5X increase in Gen AI Performance in just 5 weeks of Jetson AGX Thor Launch Jetson Thor jetson , llama-31-8b-instruct , llama , deepseek , nemotron	46	3655	December 14, 2025

Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano

Related topics