nvidia/Nemotron-Cascade-2-30B-A3B yet another model to test

Well. Just got Mistral Small 4 running on my Spark ready to take for a test ride over the weekend, the next model pops up. šŸ˜… NVIDIA seems to be pretty serious about shipping more useful models and improving them, because… the more you buy, the more you save! šŸ˜‚

The comparison chart makes me curious - claiming to be even a better coder than Qwen3.5-397B-A17B, Kimi-K2.5-1T.

Would love to hear feedback on real life use cases from you.

(EDIT) And I will also test it on my own, of course.

3 Likes

I managed to run it following the configuration of Nemotron and other Nemotron relatives, but it exploited high contexts and gave me 31t/s, so I haven’t been able to see its usefulness in the real world.

Interesting! Thanks!
Did you use nvidia/Nemotron-Cascade-2-30B-A3B or quantized version? (there are some early AWQs on Huggingface).

nvidia/Nemotron-Cascade-2-30B-A3B Ā· Hugging Face BF16, I haven’t tried AWQ, I hope NVFP4 performs as well as the Nemotron 30b.

Yeah, just ran a quick test, and it does give 31 t/s:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/Nemotron-Cascade-2-30B-A3B pp2048 3282.36 ± 1446.22 874.98 ± 554.72 868.22 ± 554.72 875.18 ± 554.82
nvidia/Nemotron-Cascade-2-30B-A3B tg32 31.00 ± 0.16 31.41 ± 0.59
nvidia/Nemotron-Cascade-2-30B-A3B ctx_pp @ d8192 3076.27 ± 1745.63 5768.06 ± 5460.11 5761.30 ± 5460.11 5768.13 ± 5460.13
nvidia/Nemotron-Cascade-2-30B-A3B ctx_tg @ d8192 31.14 ± 0.52 31.64 ± 0.90
nvidia/Nemotron-Cascade-2-30B-A3B pp2048 @ d8192 2767.06 ± 31.61 746.98 ± 8.49 740.23 ± 8.49 747.16 ± 8.48
nvidia/Nemotron-Cascade-2-30B-A3B tg32 @ d8192 30.80 ± 0.04 31.00 ± 0.00
nvidia/Nemotron-Cascade-2-30B-A3B ctx_pp @ d16384 4193.91 ± 32.15 3913.60 ± 30.11 3906.85 ± 30.11 3913.82 ± 30.10
nvidia/Nemotron-Cascade-2-30B-A3B ctx_tg @ d16384 31.21 ± 0.99 31.89 ± 1.26
nvidia/Nemotron-Cascade-2-30B-A3B pp2048 @ d16384 2304.87 ± 241.30 905.93 ± 101.51 899.18 ± 101.51 906.03 ± 101.46
nvidia/Nemotron-Cascade-2-30B-A3B tg32 @ d16384 30.53 ± 0.05 31.00 ± 0.00

llama-benchy (0.3.5)
date: 2026-03-21 21:46:56 | latency mode: api | pp basis: ttfr

I’d wait for FP8 version though.

1 Like

On two nodes (just out of the interest, I don’t think it’s very practical):

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/Nemotron-Cascade-2-30B-A3B pp2048 4657.63 ± 2361.64 699.82 ± 510.46 692.66 ± 510.46 699.99 ± 510.49
nvidia/Nemotron-Cascade-2-30B-A3B tg32 55.66 ± 4.02 57.47 ± 4.15
nvidia/Nemotron-Cascade-2-30B-A3B ctx_pp @ d8192 4820.68 ± 2962.00 5105.94 ± 5534.76 5098.78 ± 5534.76 5106.09 ± 5534.70
nvidia/Nemotron-Cascade-2-30B-A3B ctx_tg @ d8192 53.97 ± 1.44 55.79 ± 1.57
nvidia/Nemotron-Cascade-2-30B-A3B pp2048 @ d8192 4240.50 ± 167.67 490.90 ± 19.67 483.74 ± 19.67 491.14 ± 19.79
nvidia/Nemotron-Cascade-2-30B-A3B tg32 @ d8192 54.53 ± 2.10 56.30 ± 2.16
nvidia/Nemotron-Cascade-2-30B-A3B ctx_pp @ d16384 6842.95 ± 94.65 2401.91 ± 33.33 2394.75 ± 33.33 2402.03 ± 33.38
nvidia/Nemotron-Cascade-2-30B-A3B ctx_tg @ d16384 54.54 ± 2.95 56.31 ± 3.04
nvidia/Nemotron-Cascade-2-30B-A3B pp2048 @ d16384 3737.33 ± 246.17 557.65 ± 38.00 550.49 ± 38.00 557.81 ± 37.94
nvidia/Nemotron-Cascade-2-30B-A3B tg32 @ d16384 55.27 ± 3.53 57.07 ± 3.64

llama-benchy (0.3.5)
date: 2026-03-21 21:56:39 | latency mode: api | pp basis: ttfr

2 Likes

I’ll make a recipe later, but here is a launch command for now:

./launch-cluster.sh  --solo \
exec vllm serve nvidia/Nemotron-Cascade-2-30B-A3B \
	    --port 8888 \
	    --host 0.0.0.0 \
	    --enable-prefix-caching \
	    --enable-auto-tool-choice \
	    --tool-call-parser qwen3_coder \
	    --reasoning-parser qwen3 \
	    --trust-remote-code \
	    --load-format fastsafetensors \
	    --gpu-memory-utilization 0.7

I tried it in Open Code, and I’m not very impressed so far.

Damn I forgot to set the reason parser on my first attempts. šŸ¤¦ā€ā™‚ļø Thanks for posting your recipe.

May be NVIDIA needs to pimp opencode and make a ā€œNemoCodeā€ first… šŸ˜‚

I will do some testing today after being disappointed by Mistral yesterday. I also tried Mistrals Vibe Cli… didn’t convice me either. Used that after seeing template issues (see other post).

The one in the stelterlab repo was made by me using a recipe for nemo I used for the previous nemo. AWQ delivers around 70 t/s on a single spark. I will do more test today with opencode. And make a full bench with llama-benchy.

I spun up this model in an OpenCode project, and asked it to list all the Javascript files in a folder.

My main model at the moment, MiniMax M2.5, happily reported all the files it found including all those in subfolders. This model, however, seemed to stumble at the starting point and reported that it couldn’t find any files and didn’t bother to look in subfolders.

Not an auspicious start as a demonstration of its reasoning abilities.

It must be good for something, but certainly not for coding.

Or I just found a new challenge for coding models. I stumbled upon FastMCP a while ago. And actually I planned to take a deeper look into that this weekend… so I thought combine both would be good idea.

I used opencode and VS Code Insider, just to see if opencode is just bad in combo with that model. Both agents had access to context7. My test mission:

Step 1

Create a new python application. Use uv for package management and use Python 3.12. Create a .venv with uv and install dependencies with uv.

The python application shall use the FastMCP library and serve a tool that returns the current date and time via MCP.

Step 2

Create a simple client to test the server.

End of mission.

What shall I say. It failed badly. It used context7 for getting the info on how to use FastMCP, but I’m not sure if Nemo understood what it got.

I tried first my AWQ quant and then the fully blown model which was even worse in using the tools…

I also tried good old gpt-oss-120b, because it was already running on another system. And it succeeded in one shot. Nemo struggled even at the regular python version on my mac, so I gave more hints in the first prompt… but it failed again. May be its better with more precise instructions… but it is not my model.

ā€œThere’s no replacement for disp model parameter count.ā€

Benchmaxxing is getting old.

1 Like

Might also be worth trying the nemotron_v3 parser since there are some subtle differences with that and the qwen3 version baked into vllm.

Well. I highly doubt that the ā€œextraā€ world knowledge also included is necessary for coding. I still think a bunch of smaller more specialized models working together would be much better.

I also would like to see some smaller (<= 35B) fine tuned models using CoderForge dataset which was made open source by together.ai.

BTW I just tried Qwen 3.5 35B out of curiosity for that problem which did it also in one shot. This model was named as a good replacement for gpt-oss-120b.

That’s worth a try. Even if it is not included in the repo.

You can use nemotron-3-nano or nemotron-3-super mod.
But I’m observing some strange behavior from Nemotron 3 Super as well, I wonder if something is broken in vLLM again. It passes tests, because it can give coherent responses, but the responses are still dumb across the entire Nemotron line.

The reasoner parser suitable should be deepseek_r1 or nemotron_v3 instead of qwen3

I will re-run my tests tomorrow. 🤪

1 Like

Per this discussion: nvidia/Nemotron-Cascade-2-30B-A3B Ā· Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25

It seems this model (and probably also Nemotron-3-Nano-4B) require this parameter in the vLLM startup:

--mamba_ssm_cache_dtype float32

Looks like it made night-and-day difference on critical benchmarks and Nvidia themselves confirm it is crucial.

Maybe take another look…

3 Likes

Interesting, I’ll have a look.

It doesn’t get better with the newly recommended arguments. Tried full blown model and my AWQ quant.

Now it even fails when creating the .venv (last test in VS Code).

It tries to install uv (already installed) and then prompts to run ā€œpython3.12 -m venv .venvā€. Odd.

vLLM version used 0.17.2rc1.dev73+g5dd8df070.d20260318