Wondering if anybody has some good build recipes for optimizing Nemotron3 for OpenClaw with NVIDIA-specific tools and formats on Spark to maximize performance. (NVFP4, trtllm-serve, full model feature support, etc.)
Right now ~16 tok/sec is about as good as it gets, but at least it seems stable which is something⊠Have a look here for details.
Its a bit painful when using for openClaw experiments, but it does work. People have figured out how to get mistralâs 119B NFP4 model running using some clever work-arounds. Going to have a look at that over the next few days as supposedly getting 30+ tok/sec, but Iâm not sure how capable that model is for tool tasks etc. TBD
So whereâs the 1 TFLOP & 100 tokens/sec NVFP4 performance then?
Preaching to the choir my friend. There are some discussions elsewhere on this, but the details are not encouraging for dense model inference.
Using right now nemotron 3 NANO 30B A3B NVFP4 with openclaw. run only at 0.75 of KVcache and it seem ok, but sometimes the LLM times out other times it empties up the VRAM from nothing. And it quits vLLM also. Openclaw wont crush that easy, but vLLM on nemotron can quit often, im currently trying to understand why this happens. it says something like **EngineCore encountered an issue. See stack trace (above) for the root cause.
**
APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] AsyncLLM output_handler failed.
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] Traceback (most recent call last):
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] File â/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.pyâ, line 498, in output_handler
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] outputs = await engine_core.get_output_async()
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] File â/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.pyâ, line 885, in get_output_async
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] raise self._format_exception(outputs) from None
(APIServer pid=167) ERROR 03-19 00:04:13 [async_llm.py:546] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] Error in chat completion stream generator.
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] Traceback (most recent call last):
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File â/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.pyâ, line 619, in chat_completion_stream_generator
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] async for res in result_generator:
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File â/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.pyâ, line 444, in generate
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] out = q.get_nowait() or await q.get()
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] ^^^^^^^^^^^^^
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File â/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.pyâ, line 70, in get
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] raise output
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File â/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.pyâ, line 498, in output_handler
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] outputs = await engine_core.get_output_async()
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] File â/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.pyâ, line 885, in get_output_async
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] raise self._format_exception(outputs) from None
(APIServer pid=167) ERROR 03-19 00:04:13 [serving_chat.py:1287] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[rank0]:[W319 00:04:13.733902718 ProcessGroupNCCL.cpp:1564] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Distributed communication package - torch.distributed â PyTorch 2.10 documentation (function operator())
(APIServer pid=167) INFO: Shutting down
(APIServer pid=167) INFO: Waiting for application shutdown.
(APIServer pid=167) INFO: Application shutdown complete.
(APIServer pid=167) INFO: Finished server process [167]
Ollama has an option to deploy OpenClaw now. Just tried that with nemotron-3-super (it works far easier than the NemoClaw instructions, which are lacking a bunch of stuff for Spark). Takes about 100GB of RAM to run. I just used their regular nemo3s install option, so obviously not optimized for TensorRT and I donât know the quant. It runs. Pp is slow. Tokens/second is usable. Had to set up the Ollama key for web search and web fetch and it runs now. Asked it to do a few things after initial setup questions for heart and soul.mdâs, enabling thinking and reasoning (not sure what the difference is - reasoning is set to on, thinking set to adaptive) and after a few queries like âcan you make your own skills?â, token context went overbudget to over 600K above 128K, so thatâs likely gonna be a problem. I canât see this being usable for any kind of long-term business research on nemo3S unless NVFP4 quants on TensorRT is as straight-forward as just âollama launch openclawâ.
When I tried using the basic NemoClaw script, it just quit after step 2-3, and then when you use ânemoclaw onboardâ it repeats step 3, duplicating the docker image. And then thereâs also a hard to spot âsetup-sparkâ command for Sparks because of docker and cgroups v2 issues with DGX OS, and yet they donât correct for that, leaving duplicated docker images installed everytime it errors out. Itâs kind of a mess. And yet the dedicated NemoClaw instructions for Spark still donât use TRTLLM.
I to am awaiting the nemotron-3-super NVFP4 quants on TensorRT to utilize with OpenClaw. Maybe with the Cuda 3.2 release?
I tried NVIDIAâs own Nemo3S NIM. I ran their check on it, and there is no GB10-specific engine build in it. They are all amd64 runtimes for B and H server GPUs. Also, there arenât any TRTLLM runners included - they are all vLLM.
Iâm going to give trtllm-serve one more manual try, this time with Nemo3Nano NVFP4.
I donât recall seeing a GB10 specific NIM for it either, but Iâll check the engines in the container if I get some time.
I tried to get nemotron-3-super to work today but could not get a connection to it to function. Stated could not find nvida/nemâŠ. when I tired to chat with an agent I am sure itâs because of the settings in the openclaw.json settings. Would mind sharing the code you use to connect the two?
The easiest way I could get Nemo3S to work with OpenClaw is to use Ollama and launch a clean OpenClaw from it. In simple terms:
-
Install Ollama from their script on their page.
-
Run:
ollama launch openclaw â-model nemotron-3-super:latest
This is NOT going to be an optimized NVFP4 variant with tensor core optimizations. When itâs running on a GB10, youâre gonna see ~100GB of RAM usage. Youâre gonna want to set up a memory storage option to keep the context window under control. You can do this by chatting with the bot during and after the original setup. Just tell it you are concerned about the context window and that you want to keep it under control, and it should do what itâs supposed to and come up with a management system to trim and compact previous conversations into a log that it can read later. Youâll need an Ollama API key to turn on web_search and web_fetch. Give it those skills right away. Reasoning can be turned on, and thinking set to Adaptive seems to work, but expect about a 3 minute wait for the start of a response for any query you give it.
Not a big fan of Ollama. 3 minutes is a Massive problem when vLLM and gpt-oss-120b is ~3 seconds.
Letâs keep this chat thread at the top of the forum and Hopefully NVIDIA will fix the TRTLLM / nvfp4 issues ASAP for the Sparks. I foresee this as the ultimate solution for OpenClaw/NemoClaw on sparks, which is a near perfect platform for deployment under $8KâŠ
I havenât seen any claims of Super or any 100-200B param model do 3 seconds for PP. This isnât datacenter hardware. A 1 million token context window is also unrealistic on this hardware with that size of model.
Looks like weâre going to have to wait for an official NIM, or at least real TRTLLM support to see any improvement. Reading through the forum, I havenât found a single person thatâs gotten a successful tensor optimized kernel to build for trtllm-serve. This looks like another example of hardware being pushed out with big claims but without software support to prove it.
Did you find a solution to this one, as I am having same issue âAsyncLLM output_handler failedâ randomly.
Openclaw / Neomtron-3-Super is why I purchased the DGX Spark, but the 3 minute initial load time, 1-3 minute response times, 15 t/s and the crashes every few hours, is infuriating at this point.
I do hope the team working on this see this as unacceptable, as it is a Far cry from the advertised 1 TFLOP & 100 tokens/sec NVFP4 performance ⊠@aniculescu
Here let me fix that for you NVIDIA. we Think it can do 1 TFLOP & 100 tokens/sec NVFP4 performance, but you have to figure that out for yourself, and if you do let us know, weâll happily take the credit for it.
I think the guys that have stuck around @eugr and Earned the Spark Expert badge, should be given full access and a stack of Sparks as they are essentially doing NVIDIAâs job for them :-(
ok, now back to my coffee.
Alright I just updated Ollama with OpenClaw and the Nemotron-Cascade-2 model and itâs way better than Super. There are some new updates for Super and Nano via NIMâs. Iâm testing more stuff with them.
