OpenClaw w/ Nemotron-3-Super NVFP4 TensorRT inference on Spark Discussion

joe_raby · March 17, 2026, 6:35pm

Wondering if anybody has some good build recipes for optimizing Nemotron3 for OpenClaw with NVIDIA-specific tools and formats on Spark to maximize performance. (NVFP4, trtllm-serve, full model feature support, etc.)

mikee.gwu · March 18, 2026, 3:13am

Right now ~16 tok/sec is about as good as it gets, but at least it seems stable which is something… Have a look here for details.

Its a bit painful when using for openClaw experiments, but it does work. People have figured out how to get mistral’s 119B NFP4 model running using some clever work-arounds. Going to have a look at that over the next few days as supposedly getting 30+ tok/sec, but I’m not sure how capable that model is for tool tasks etc. TBD

joe_raby · March 18, 2026, 9:38pm

So where’s the 1 TFLOP & 100 tokens/sec NVFP4 performance then?

mikee.gwu · March 18, 2026, 10:55pm

Preaching to the choir my friend. There are some discussions elsewhere on this, but the details are not encouraging for dense model inference.

jonasbridelu · March 18, 2026, 11:58pm

Using right now nemotron 3 NANO 30B A3B NVFP4 with openclaw. run only at 0.75 of KVcache and it seem ok, but sometimes the LLM times out other times it empties up the VRAM from nothing. And it quits vLLM also. Openclaw wont crush that easy, but vLLM on nemotron can quit often, im currently trying to understand why this happens. it says something like **EngineCore encountered an issue. See stack trace (above) for the root cause.

**
APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] AsyncLLM output_handler failed.
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] Traceback (most recent call last):
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 498, in output_handler
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] outputs = await engine_core.get_output_async()
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 885, in get_output_async
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] raise self._format_exception(outputs) from None
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] Error in chat completion stream generator.
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] Traceback (most recent call last):
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py”, line 619, in chat_completion_stream_generator
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] async for res in result_generator:
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 444, in generate
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] out = q.get_nowait() or await q.get()
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] ^^^^^^^^^^^^^
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py”, line 70, in get
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] raise output
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 498, in output_handler
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] outputs = await engine_core.get_output_async()
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 885, in get_output_async
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] raise self._format_exception(outputs) from None
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[rank0]:[W319 00:04:13.733902718 ProcessGroupNCCL.cpp:1564] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Distributed communication package - torch.distributed — PyTorch 2.10 documentation (function operator())
(APIServer pid=167) INFO: Shutting down
(APIServer pid=167) INFO: Waiting for application shutdown.
(APIServer pid=167) INFO: Application shutdown complete.
(APIServer pid=167) INFO: Finished server process [167]

joe_raby · March 19, 2026, 12:27am

Ollama has an option to deploy OpenClaw now. Just tried that with nemotron-3-super (it works far easier than the NemoClaw instructions, which are lacking a bunch of stuff for Spark). Takes about 100GB of RAM to run. I just used their regular nemo3s install option, so obviously not optimized for TensorRT and I don’t know the quant. It runs. Pp is slow. Tokens/second is usable. Had to set up the Ollama key for web search and web fetch and it runs now. Asked it to do a few things after initial setup questions for heart and soul.md’s, enabling thinking and reasoning (not sure what the difference is - reasoning is set to on, thinking set to adaptive) and after a few queries like “can you make your own skills?”, token context went overbudget to over 600K above 128K, so that’s likely gonna be a problem. I can’t see this being usable for any kind of long-term business research on nemo3S unless NVFP4 quants on TensorRT is as straight-forward as just “ollama launch openclaw”.

When I tried using the basic NemoClaw script, it just quit after step 2-3, and then when you use ‘nemoclaw onboard’ it repeats step 3, duplicating the docker image. And then there’s also a hard to spot ‘setup-spark’ command for Sparks because of docker and cgroups v2 issues with DGX OS, and yet they don’t correct for that, leaving duplicated docker images installed everytime it errors out. It’s kind of a mess. And yet the dedicated NemoClaw instructions for Spark still don’t use TRTLLM.

Digital_David · March 23, 2026, 2:29pm

I to am awaiting the nemotron-3-super NVFP4 quants on TensorRT to utilize with OpenClaw. Maybe with the Cuda 3.2 release?

joe_raby · March 24, 2026, 6:38pm

I tried NVIDIA’s own Nemo3S NIM. I ran their check on it, and there is no GB10-specific engine build in it. They are all amd64 runtimes for B and H server GPUs. Also, there aren’t any TRTLLM runners included - they are all vLLM.

I’m going to give trtllm-serve one more manual try, this time with Nemo3Nano NVFP4.

I don’t recall seeing a GB10 specific NIM for it either, but I’ll check the engines in the container if I get some time.

Digital_David · March 25, 2026, 8:57pm

I tried to get nemotron-3-super to work today but could not get a connection to it to function. Stated could not find nvida/nem…. when I tired to chat with an agent I am sure it’s because of the settings in the openclaw.json settings. Would mind sharing the code you use to connect the two?

joe_raby · March 25, 2026, 9:21pm

The easiest way I could get Nemo3S to work with OpenClaw is to use Ollama and launch a clean OpenClaw from it. In simple terms:

Install Ollama from their script on their page.
Run: ollama launch openclaw –-model nemotron-3-super:latest

This is NOT going to be an optimized NVFP4 variant with tensor core optimizations. When it’s running on a GB10, you’re gonna see ~100GB of RAM usage. You’re gonna want to set up a memory storage option to keep the context window under control. You can do this by chatting with the bot during and after the original setup. Just tell it you are concerned about the context window and that you want to keep it under control, and it should do what it’s supposed to and come up with a management system to trim and compact previous conversations into a log that it can read later. You’ll need an Ollama API key to turn on web_search and web_fetch. Give it those skills right away. Reasoning can be turned on, and thinking set to Adaptive seems to work, but expect about a 3 minute wait for the start of a response for any query you give it.

Digital_David · March 26, 2026, 3:03pm

Not a big fan of Ollama. 3 minutes is a Massive problem when vLLM and gpt-oss-120b is ~3 seconds.

Let’s keep this chat thread at the top of the forum and Hopefully NVIDIA will fix the TRTLLM / nvfp4 issues ASAP for the Sparks. I foresee this as the ultimate solution for OpenClaw/NemoClaw on sparks, which is a near perfect platform for deployment under $8K…

joe_raby · March 26, 2026, 3:57pm

I haven’t seen any claims of Super or any 100-200B param model do 3 seconds for PP. This isn’t datacenter hardware. A 1 million token context window is also unrealistic on this hardware with that size of model.

Looks like we’re going to have to wait for an official NIM, or at least real TRTLLM support to see any improvement. Reading through the forum, I haven’t found a single person that’s gotten a successful tensor optimized kernel to build for trtllm-serve. This looks like another example of hardware being pushed out with big claims but without software support to prove it.

Digital_David · April 1, 2026, 4:33pm

Did you find a solution to this one, as I am having same issue “AsyncLLM output_handler failed” randomly.

Digital_David · April 2, 2026, 1:57pm

Openclaw / Neomtron-3-Super is why I purchased the DGX Spark, but the 3 minute initial load time, 1-3 minute response times, 15 t/s and the crashes every few hours, is infuriating at this point.

I do hope the team working on this see this as unacceptable, as it is a Far cry from the advertised 1 TFLOP & 100 tokens/sec NVFP4 performance … @aniculescu

Here let me fix that for you NVIDIA. we Think it can do 1 TFLOP & 100 tokens/sec NVFP4 performance, but you have to figure that out for yourself, and if you do let us know, we’ll happily take the credit for it.

I think the guys that have stuck around @eugr and Earned the Spark Expert badge, should be given full access and a stack of Sparks as they are essentially doing NVIDIA’s job for them :-(

ok, now back to my coffee.

joe_raby · April 9, 2026, 2:22am

Alright I just updated Ollama with OpenClaw and the Nemotron-Cascade-2 model and it’s way better than Super. There are some new updates for Super and Nano via NIM’s. I’m testing more stuff with them.

Topic		Replies	Views
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	10601	March 31, 2026
Running nvidia/nemotron-3-super on DGX spark DGX Spark / GB10 nemotron	12	2318	March 26, 2026
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 DGX Spark / GB10 nemotron	31	2418	June 10, 2026
NemoClaw on Spark DGX Spark / GB10 agentic-ai	56	5561	March 24, 2026
Nemotron-3-Super-120B-A12B-NVFP4 on single DGX Spark: 23.45 tok/s (spark-arena.com/ benhmarks) DGX Spark / GB10 cuda , benchmarks , spark , llm , nemotron , dgx , nemoclaw	6	1106	May 26, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	2336	December 22, 2025
Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix DGX Spark / GB10 Projects llama , nemotron	7	1310	July 8, 2026
[Benchmark] nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 DGX Spark / GB10 Projects cuda , spark , jetson , llm , nemotron	5	1368	May 1, 2026
Nemotron-3-Super-120B at 20-22 tok/s Super Special Recipe DGX Spark / GB10 nemotron	4	900	May 30, 2026
Total nightmare : NEMOCLAW over Paperclip over OPENCLAW over vLLM over Dokers, over LLM flavours , over Linux DGX Spark / GB10	14	3830	March 25, 2026

OpenClaw w/ Nemotron-3-Super NVFP4 TensorRT inference on Spark Discussion

Related topics