During loading the weights, error reported as following: ORTE has lost communication with a remote daemon. HNP daemon : [[41677,0],0] on node spark-c3c9 Remote daemon: [[41677,0],1] on node 169.254.22.181

thanks for the information. finally, I figure out what went wrong. although the guide in the playbook does not mention the required version of tensorrt-llm, but in the logs, it shows 1.0.0rc3, But I tried the latest version which is 1.2.0rc1, that was not working, even though the final release of …

after turning the debug option, it shows error code 137, which means OOM. I tried to lower the gpu memory fraction, but not work either.

Hi, can you give more details on how you are deploying this model? Reproduction steps and logs would be very helpful

I am following the instructions from this link dgx-spark-playbooks/nvidia/trt-llm/README.md at main · NVIDIA/dgx-spark-playbooks · GitHub If I use the model gpt-oss-120b instead of Qwen3-235B-A22B-FP4 , this case can be run successfully. all of the output is the same as the link showed, except the…

If you are having memory issues, it may be an issue with your cache. Please reference our FAQ on how to clear the cache. [image] DGX Spark / GB10 FAQ DGX Spark / GB10 Initial Setup <a name="p-1694056-q-i-am-trying-to-set-up-my-dgx-spark-in-appliance-mode-i-am-able-to-connect-to-the-dgx-spark-ssid-but-it-cant-open-the-webpage-it-tries-to-redirect-me-to-httpspark-xxxxlocalhttpspark-xxxxlocal-2" class="anchor" href="#p-1694056-q-i-am-trying-to-set-up-my-dgx-spark-in-appliance-mode-i-am-able-to-connect-to-the-dgx-spark-ssid-but-it-cant-open-the-webpage-it-tries-to-redirect-me-to-httpspark-xxxxlocalhttpspark-xxxxlocal-2"> Q: I am trying to set up my DGX Spark in appliance mode. I am able to connect to…

Failed to run Qwen3-235B-A22B-FP4 model on a two spark's cluster

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

aniculescu October 30, 2025, 2:30am 6

If you are having memory issues, it may be an issue with your cache. Please reference our FAQ on how to clear the cache.

Topic		Replies	Views
Qwen3-235B-A22B-NVFP4 Playbook Example Hangs DGX Spark / GB10	5	502	April 2, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	56	5119	April 13, 2026
DGX Spark crashes when running tensorrt-llm DGX Spark / GB10 llama	3	217	March 7, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	737	December 19, 2025
[Bug] TensorRT-LLM 1.2.0rc8: "TRTLLMGenFusedMoE does not support SM120" error on DGX Spark with gpt-oss-120b + Eagle3 DGX Spark / GB10 tensorrt	9	529	February 17, 2026
[Issue] Qwen3-Next-80B NVFP4 and FP8 Cannot Be Served via trtllm-serve on DGX Spark GB10 (TRT-LLM 1.3.0rc7) DGX Spark / GB10 tensorrt , kernel	2	212	May 1, 2026
Bf16 LoRA Fine-Tuning of Qwen3.5-35B-A3B on DGX Spark — No Quantization Required DGX Spark / GB10 Projects training , ai-model-training	5	768	April 6, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	35	2211	May 1, 2026
DGX Spark performance DGX Spark / GB10	50	4821	February 27, 2026
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	26	1762	April 28, 2026

Failed to run Qwen3-235B-A22B-FP4 model on a two spark's cluster

Related topics