nvidia/Nemotron-Cascade-2-30B-A3B yet another model to test

cosinus · March 21, 2026, 10:29am

Well. Just got Mistral Small 4 running on my Spark ready to take for a test ride over the weekend, the next model pops up. 😅 NVIDIA seems to be pretty serious about shipping more useful models and improving them, because… the more you buy, the more you save! 😂

The comparison chart makes me curious - claiming to be even a better coder than Qwen3.5-397B-A17B, Kimi-K2.5-1T.

Would love to hear feedback on real life use cases from you.

(EDIT) And I will also test it on my own, of course.

vedcsolution · March 21, 2026, 7:14pm

I managed to run it following the configuration of Nemotron and other Nemotron relatives, but it exploited high contexts and gave me 31t/s, so I haven’t been able to see its usefulness in the real world.

stefan132 · March 21, 2026, 7:28pm

Interesting! Thanks!
Did you use nvidia/Nemotron-Cascade-2-30B-A3B or quantized version? (there are some early AWQs on Huggingface).

vedcsolution · March 21, 2026, 11:38pm

nvidia/Nemotron-Cascade-2-30B-A3B · Hugging Face BF16, I haven’t tried AWQ, I hope NVFP4 performs as well as the Nemotron 30b.

eugr · March 22, 2026, 4:51am

Yeah, just ran a quick test, and it does give 31 t/s:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/Nemotron-Cascade-2-30B-A3B	pp2048	3282.36 ± 1446.22		874.98 ± 554.72	868.22 ± 554.72	875.18 ± 554.82
nvidia/Nemotron-Cascade-2-30B-A3B	tg32	31.00 ± 0.16	31.41 ± 0.59
nvidia/Nemotron-Cascade-2-30B-A3B	ctx_pp @ d8192	3076.27 ± 1745.63		5768.06 ± 5460.11	5761.30 ± 5460.11	5768.13 ± 5460.13
nvidia/Nemotron-Cascade-2-30B-A3B	ctx_tg @ d8192	31.14 ± 0.52	31.64 ± 0.90
nvidia/Nemotron-Cascade-2-30B-A3B	pp2048 @ d8192	2767.06 ± 31.61		746.98 ± 8.49	740.23 ± 8.49	747.16 ± 8.48
nvidia/Nemotron-Cascade-2-30B-A3B	tg32 @ d8192	30.80 ± 0.04	31.00 ± 0.00
nvidia/Nemotron-Cascade-2-30B-A3B	ctx_pp @ d16384	4193.91 ± 32.15		3913.60 ± 30.11	3906.85 ± 30.11	3913.82 ± 30.10
nvidia/Nemotron-Cascade-2-30B-A3B	ctx_tg @ d16384	31.21 ± 0.99	31.89 ± 1.26
nvidia/Nemotron-Cascade-2-30B-A3B	pp2048 @ d16384	2304.87 ± 241.30		905.93 ± 101.51	899.18 ± 101.51	906.03 ± 101.46
nvidia/Nemotron-Cascade-2-30B-A3B	tg32 @ d16384	30.53 ± 0.05	31.00 ± 0.00

llama-benchy (0.3.5)
date: 2026-03-21 21:46:56 | latency mode: api | pp basis: ttfr

I’d wait for FP8 version though.

eugr · March 22, 2026, 4:57am

On two nodes (just out of the interest, I don’t think it’s very practical):

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/Nemotron-Cascade-2-30B-A3B	pp2048	4657.63 ± 2361.64		699.82 ± 510.46	692.66 ± 510.46	699.99 ± 510.49
nvidia/Nemotron-Cascade-2-30B-A3B	tg32	55.66 ± 4.02	57.47 ± 4.15
nvidia/Nemotron-Cascade-2-30B-A3B	ctx_pp @ d8192	4820.68 ± 2962.00		5105.94 ± 5534.76	5098.78 ± 5534.76	5106.09 ± 5534.70
nvidia/Nemotron-Cascade-2-30B-A3B	ctx_tg @ d8192	53.97 ± 1.44	55.79 ± 1.57
nvidia/Nemotron-Cascade-2-30B-A3B	pp2048 @ d8192	4240.50 ± 167.67		490.90 ± 19.67	483.74 ± 19.67	491.14 ± 19.79
nvidia/Nemotron-Cascade-2-30B-A3B	tg32 @ d8192	54.53 ± 2.10	56.30 ± 2.16
nvidia/Nemotron-Cascade-2-30B-A3B	ctx_pp @ d16384	6842.95 ± 94.65		2401.91 ± 33.33	2394.75 ± 33.33	2402.03 ± 33.38
nvidia/Nemotron-Cascade-2-30B-A3B	ctx_tg @ d16384	54.54 ± 2.95	56.31 ± 3.04
nvidia/Nemotron-Cascade-2-30B-A3B	pp2048 @ d16384	3737.33 ± 246.17		557.65 ± 38.00	550.49 ± 38.00	557.81 ± 37.94
nvidia/Nemotron-Cascade-2-30B-A3B	tg32 @ d16384	55.27 ± 3.53	57.07 ± 3.64

llama-benchy (0.3.5)
date: 2026-03-21 21:56:39 | latency mode: api | pp basis: ttfr

eugr · March 22, 2026, 5:46am

I’ll make a recipe later, but here is a launch command for now:

./launch-cluster.sh  --solo \
exec vllm serve nvidia/Nemotron-Cascade-2-30B-A3B \
	    --port 8888 \
	    --host 0.0.0.0 \
	    --enable-prefix-caching \
	    --enable-auto-tool-choice \
	    --tool-call-parser qwen3_coder \
	    --reasoning-parser qwen3 \
	    --trust-remote-code \
	    --load-format fastsafetensors \
	    --gpu-memory-utilization 0.7

I tried it in Open Code, and I’m not very impressed so far.

cosinus · March 22, 2026, 8:04am

Damn I forgot to set the reason parser on my first attempts. 🤦‍♂️ Thanks for posting your recipe.

May be NVIDIA needs to pimp opencode and make a “NemoCode” first… 😂

I will do some testing today after being disappointed by Mistral yesterday. I also tried Mistrals Vibe Cli… didn’t convice me either. Used that after seeing template issues (see other post).

cosinus · March 22, 2026, 8:34am

The one in the stelterlab repo was made by me using a recipe for nemo I used for the previous nemo. AWQ delivers around 70 t/s on a single spark. I will do more test today with opencode. And make a full bench with llama-benchy.

brian322 · March 22, 2026, 12:00pm

I spun up this model in an OpenCode project, and asked it to list all the Javascript files in a folder.

My main model at the moment, MiniMax M2.5, happily reported all the files it found including all those in subfolders. This model, however, seemed to stumble at the starting point and reported that it couldn’t find any files and didn’t bother to look in subfolders.

Not an auspicious start as a demonstration of its reasoning abilities.

cosinus · March 22, 2026, 2:11pm

It must be good for something, but certainly not for coding.

Or I just found a new challenge for coding models. I stumbled upon FastMCP a while ago. And actually I planned to take a deeper look into that this weekend… so I thought combine both would be good idea.

I used opencode and VS Code Insider, just to see if opencode is just bad in combo with that model. Both agents had access to context7. My test mission:

Step 1

Create a new python application. Use uv for package management and use Python 3.12. Create a .venv with uv and install dependencies with uv.

The python application shall use the FastMCP library and serve a tool that returns the current date and time via MCP.

Step 2

Create a simple client to test the server.

End of mission.

What shall I say. It failed badly. It used context7 for getting the info on how to use FastMCP, but I’m not sure if Nemo understood what it got.

I tried first my AWQ quant and then the fully blown model which was even worse in using the tools…

I also tried good old gpt-oss-120b, because it was already running on another system. And it succeeded in one shot. Nemo struggled even at the regular python version on my mac, so I gave more hints in the first prompt… but it failed again. May be its better with more precise instructions… but it is not my model.

jrsphd · March 22, 2026, 2:44pm

“There’s no replacement for ~~disp~~ model parameter count.”

Benchmaxxing is getting old.

trystan1 · March 22, 2026, 3:43pm

Might also be worth trying the nemotron_v3 parser since there are some subtle differences with that and the qwen3 version baked into vllm.

cosinus · March 22, 2026, 3:56pm

Well. I highly doubt that the “extra” world knowledge also included is necessary for coding. I still think a bunch of smaller more specialized models working together would be much better.

I also would like to see some smaller (<= 35B) fine tuned models using CoderForge dataset which was made open source by together.ai.

BTW I just tried Qwen 3.5 35B out of curiosity for that problem which did it also in one shot. This model was named as a good replacement for gpt-oss-120b.

cosinus · March 22, 2026, 3:59pm

That’s worth a try. Even if it is not included in the repo.

eugr · March 22, 2026, 10:51pm

You can use nemotron-3-nano or nemotron-3-super mod.
But I’m observing some strange behavior from Nemotron 3 Super as well, I wonder if something is broken in vLLM again. It passes tests, because it can give coherent responses, but the responses are still dumb across the entire Nemotron line.

cosinus · March 23, 2026, 8:17pm

The reasoner parser suitable should be deepseek_r1 or nemotron_v3 instead of qwen3

I will re-run my tests tomorrow. 🤪

joshua.dale.warner · March 24, 2026, 4:38am

Per this discussion: nvidia/Nemotron-Cascade-2-30B-A3B · Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25

It seems this model (and probably also Nemotron-3-Nano-4B) require this parameter in the vLLM startup:

--mamba_ssm_cache_dtype float32

Looks like it made night-and-day difference on critical benchmarks and Nvidia themselves confirm it is crucial.

Maybe take another look…

eugr · March 24, 2026, 6:36am

Interesting, I’ll have a look.

cosinus · March 24, 2026, 6:35pm

It doesn’t get better with the newly recommended arguments. Tried full blown model and my AWQ quant.

Now it even fails when creating the .venv (last test in VS Code).

It tries to install uv (already installed) and then prompts to run “python3.12 -m venv .venv”. Odd.

vLLM version used 0.17.2rc1.dev73+g5dd8df070.d20260318

Topic		Replies	Views
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	8576	March 31, 2026
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	84	2920	March 20, 2026
Nemotron 3 Nano 30B with llama.cpp Playbook Announcements jetson , llama , agentic-ai , nemotron	1	680	December 18, 2025
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - Not good at following simple instructions DGX Spark / GB10 nemotron	1	402	April 5, 2026
OpenClaw w/ Nemotron-3-Super NVFP4 TensorRT inference on Spark Discussion DGX Spark / GB10 nemotron	14	1321	April 9, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	1858	December 22, 2025
Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix DGX Spark / GB10 Projects llama , nemotron	3	764	March 22, 2026
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	42	3109	February 7, 2026
Testing NVIDIA-Nemotron-3-Nano-4B- Model on Nvidia DGX Spark/Jetson Thor/6000 Pro with vLLM DGX Spark / GB10 jetson , nemotron	1	193	March 22, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4766	April 11, 2026

nvidia/Nemotron-Cascade-2-30B-A3B yet another model to test

Step 1

Step 2

Related topics