Tomorrow, (or during GTC 2026) I do expect to see NemoClaw đŠ announced officially and the DGX Spark may be the perfect desktop device for this configuration. The promise of Enterprise security and safety modifications to the original OpenClaw, should lead a large spike in Spark adopters, like myself, especially in the anti-apple crowd.
With the high Prefill compared to others devices in the price range, this should be an idea for the agentic based tasks, only outputting to the decode when needed to save or for human view.
If the inference isnât stable, because of the playbooks every time thereâs a new model, I donât believe in a magical release. If itâs going to use Ubuntu and pull from the Nvidia API, well, any Ubuntu server can do that. Iâm eagerly awaiting any release, but with low expectations.
Interesting that it claims to be âEnterprise readyâ but it supposedly is still is OpenClaw under the hood.
Network guardrail, Enterprise policy, and privacy routing is claimed.
I kind of expected something a bit more like NanoClaw - simple, smaller, contained, ground up best practices. But that wouldnât have the claim to be the most important software of all time.
They also announced a Nemotron coalition. Mistral AI and Black Forrest Labs are also part of that. I find that even more exciting than Nemo Claw⊠:-D
I sense a certain leather-jacket hubris spreading in the way that, with well-tinted sunglasses, you can no longer see even the sun on the horizon â because anything that doesnât start with âNâ and threatens to stand taller simply gets filtered out. Or is this more of a âwe donât love Chineseâ thing?
Either way, N ends up undermining own genuinely remarkable achievement by overselling it so aggressively - as usually?
Especially with LLMs â itâs all just water in the same pot; everyone boils at the same temperature. Oh boy.
For math, code, and science, we start from curated problem sets and use open source permissive models such as GPT-OSS-120B to produce step-by-step reasoning traces, candidate solutions, best-of-n selection traces, and verified CUDA kernels.
Benchmarks
Benchmark
Nemotron 3 Super
Qwen3.5-122B-A10B
GPT-OSS-120B
General Knowledge
MMLU-Pro
83.73
86.70
81.00
Reasoning
AIME25 (no tools)
90.21
90.36
92.50
HMMT Feb25 (no tools)
93.67
91.40
90.00
HMMT Feb25 (with tools)
94.73
89.55
â
GPQA (no tools)
79.23
86.60
80.10
GPQA (with tools)
82.70
â
80.09
LiveCodeBench (v5 2024-08â2025-05)
81.19
78.93
88.00
SciCode (subtask)
42.05
42.00
39.00
HLE (no tools)
18.26
25.30
14.90
HLE (with tools)
22.82
â
19.0
Agentic
Terminal Bench (hard subset)
25.78
26.80
24.00
Terminal Bench Core 2.0
31.00
37.50
18.70
SWE-Bench (OpenHands)
60.47
66.40
41.9
SWE-Bench (OpenCode)
59.20
67.40
â
SWE-Bench (Codex)
53.73
61.20
â
SWE-Bench Multilingual (OpenHands)
45.78
â
30.80
TauBench V2
Airline
56.25
66.0
49.2
Retail
62.83
62.6
67.80
Telecom
64.36
95.00
66.00
Average
61.15
74.53
61.0
BrowseComp with Search
31.28
â
33.89
BIRD Bench
41.80
â
38.25
Chat & Instruction Following
IFBench (prompt)
72.56
73.77
68.32
Scale AI Multi-Challenge
55.23
61.50
58.29
Arena-Hard-V2
73.88
75.15
90.26
Long Context
AA-LCR
58.31
66.90
51.00
RULER-100 @ 256k
96.30
96.74
52.30
RULER-100 @ 512k
95.67
95.95
46.70
RULER-100 @ 1M
91.75
91.33
22.30
Multilingual
MMLU-ProX (avg over langs)
79.36
85.06
76.59
WMT24++ (enâxx)
86.67
87.84
88.89
Can someone explain how is this model âbetterâ?
For the moment, my experience is that is not performing well on sm121 and benchmark data shows Qwen3.5 122B has better overall results.
I can only confirm - so far only marketing from Nvidia is better :)
These links donât seem to work any more and itâs not clear how to configure a local model to use with Nemoclaw. Anyone done it yet and can share the details?