BTW, I’ve just added a new mod that enables using cluster launch script with NVIDIA NGC vLLM or any other vLLM container that includes Infiniband libraries and Ray support.
To use, add --apply-mod mods/use-ngc-vllm to ./launch-cluster.sh arguments. It can be combined with other mods.
For example, to launch Nemotron Nano in the cluster using NGC container, you can use the following command:
Forgot to mention that if you run in --solo mode, you don’t need to apply mods/use-ngc-vllm, it will work without it. The ngc mod is for cluster use only. You can apply other mods, like the nemotron one though.
I tested the tool calling section of NVIDIA’s example notebook:
Works as expected.
Okay, the user wants to calculate a 15% tip on a $50 bill. Let me check the tools available. There's a function called calculate_tip that takes bill_total and tip_percentage. The parameters are integers, so I need to pass 50 and 15. I should make sure the function is called correctly with those values. The user didn't mention any other details, so I don't need to ask for more info. Just plug those numbers into the tool and return the result.
[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-86c8013efccb79be', function=Function(arguments='{"bill_total": 50, "tip_percentage": 15}', name='calculate_tip'), type='function')]
May be you have to adjust temperature and top_p as recommended by the model card of the model:
temperature=1.0 and top_p=1.0 are recommended for reasoning tasks, while temperature=0.6 and top_p=0.95 are recommended for tool calling.
Temp, top_p and top_k can be set by the client per-request.
Look into this for your client. Assuming I am understanding your question, different categories of requests (i.e., tool calling) these can be set appropriately without changing the model served.
You’re the fourth or fifth report I’ve seen with this model crashing under load, honestly don’t know if this is a vllm issue or Nvidia issue but with tensorrt-llm I’m running 10x slower but not crashing
Well, good news, it seems to be more stable with my latest build (I’m using my pytorch-base branch, I think it’s good enough to merge into main now). Successfully ran the bench up to 200K context:
model
test
t/s
peak t/s
peak t/s (req)
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048
4231.32 ± 2175.79
832.16 ± 667.62
827.17 ± 667.62
832.25 ± 667.63
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32
59.11 ± 0.85
61.03 ± 0.87
61.03 ± 0.87
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d4096
5195.40 ± 3324.75
3083.73 ± 3585.09
3078.75 ± 3585.09
3083.85 ± 3585.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d4096
58.42 ± 0.40
60.32 ± 0.41
60.32 ± 0.41
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d4096
2707.10 ± 44.78
761.72 ± 12.58
756.74 ± 12.58
761.83 ± 12.58
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d4096
58.29 ± 0.12
60.20 ± 0.13
60.20 ± 0.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d8192
7076.94 ± 1182.46
1200.30 ± 226.22
1195.31 ± 226.22
1200.42 ± 226.18
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d8192
56.70 ± 1.86
58.55 ± 1.92
58.55 ± 1.92
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d8192
1606.96 ± 31.06
1279.91 ± 24.38
1274.92 ± 24.38
1280.06 ± 24.40
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d8192
56.39 ± 2.34
58.23 ± 2.41
58.23 ± 2.41
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d16384
7893.42 ± 48.36
2080.72 ± 12.70
2075.73 ± 12.70
2080.89 ± 12.63
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d16384
58.62 ± 1.16
60.53 ± 1.20
60.53 ± 1.20
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d16384
869.74 ± 10.27
2360.04 ± 27.97
2355.06 ± 27.97
2360.12 ± 27.97
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d16384
60.48 ± 3.21
62.46 ± 3.31
62.46 ± 3.31
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d32768
7304.12 ± 31.62
4491.30 ± 19.48
4486.32 ± 19.48
4491.38 ± 19.49
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d32768
56.70 ± 2.90
58.54 ± 3.00
58.54 ± 3.00
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d32768
425.43 ± 3.00
4819.15 ± 34.10
4814.17 ± 34.10
4819.25 ± 34.08
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d32768
56.72 ± 0.10
58.57 ± 0.10
58.57 ± 0.10
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d65535
6300.77 ± 40.69
10405.72 ± 67.16
10401.54 ± 67.16
10405.97 ± 67.13
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d65535
54.95 ± 0.38
56.76 ± 0.37
56.76 ± 0.37
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d65535
188.72 ± 0.95
10856.45 ± 55.10
10852.27 ± 55.10
10856.53 ± 55.09
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d65535
55.77 ± 3.61
57.58 ± 3.73
57.58 ± 3.73
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d100000
5353.06 ± 27.04
18685.58 ± 94.70
18681.39 ± 94.70
18685.70 ± 94.71
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d100000
53.91 ± 0.05
55.66 ± 0.05
55.66 ± 0.05
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d100000
105.69 ± 1.08
19383.78 ± 198.74
19379.60 ± 198.74
19383.97 ± 198.81
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d100000
53.82 ± 0.13
55.57 ± 0.14
55.57 ± 0.14
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_pp @ d200000
3707.86 ± 51.20
53953.87 ± 740.02
53949.69 ± 740.02
53954.14 ± 740.02
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
ctx_tg @ d200000
50.50 ± 0.06
52.17 ± 0.06
52.17 ± 0.06
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
pp2048 @ d200000
37.44 ± 0.38
54715.42 ± 550.76
54711.23 ± 550.76
54715.84 ± 550.91
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
tg32 @ d200000
50.41 ± 0.16
52.06 ± 0.17
52.06 ± 0.17
llama-benchy (0.3.0)
date: 2026-02-09 07:51:19 | latency mode: api
You may need to rebuild your container using the latest changes in main branch though, as I just merged my pytorch branch into main and reverted my fastsafetensors patch as the proper fix is now included in vLLM.