DGX Spark GB10 – Asus GX10 – GPU becomes inoperable

I’m having problems with Asus GX10 since the day I bought it, 3 months ago.

I’ve applied all frequent updates, lowered the clock to 2138 MHz, confirmed the BIOS firmware to be correct on both PD FW, set the GPU to persistent mode to try to keep it alive, tried full power cycle,… but without any success.

The GPU keeps going off on different time frames, without any load. Sometimes crashes one hour after rebooting, other times crashes after 3 or 4 days… always without load (low power consumption and low temperatures).

I’m running it as a server for vllm (with gemma 4 26b), openweb ui and docling, all using docker and with cuda support.

This equipment is very unstable and has been a disappointment.

Can anyone help with this issue? I’ve attached log files.

sudo_dmesg.txt (115.5 KB)

journal-current-boot.log (11.2 MB)

gpu-system-info.txt (9.5 KB)

If you have AI setting things up for you, it loves to pin stuff to lan IP’s when it should be local loopback and just weird slop like that. First thing I would check is whether it’s not pinning Lan ip’s and causing loopbacks on your network. Most modern routers will disable the port for 5 minutes or so after that’s detected. Also check your drivers, etc…is where I would start.


Reviewed both logs. The journal confirms a Class 4 failure — DOE mailbox stuck on 000f:01:00.0 with PCIe link collapsed to x0 from the first second of boot. The GPU was non-functional before any workload started. One hour later the NVIDIA driver’s work queue thread (nv_queue) locked up trying to access registers over the dead link — that’s what you’re seeing as the “crash without load.”

To continue the diagnosis, the current logs are incomplete. The sudo_dmesg.txt was filtered with grep and captured nothing because the GPU never initialized on this boot.

Please provide:

sudo journalctl -b -1 > journalctl-previous-boot.log
sudo rasdaemon --errors > rasdaemon-errors.txt
sudo dmesg > dmesg-full.txt


I’m interested in this too. I’ve been having the exact same issue where the GX10s will randomly crash for no apparent reason.

I found this post that may help:

I had:
docker info | grep -i cgroup
Cgroup Driver: systemd
Cgroup Version: 2
cgroupns

Now it is stable for 24h with:
docker info | grep -i cgroup
Cgroup Driver: cgroupfs
Cgroup Version: 2
cgroupns

I’ll keep it running continuously and test it under load with a few OpenWebUI users over the week. If everything remains stable, I’ll confirm this as the solution.

Thank you for your feedback.
The LAN ip’s are all ok without loopbacks, the GX10 never had problems with connection.
The only issue is the gpu that crashes frequently.
I hope that it is solved with cgroupfs.

Thank you for your feedback.
I’m trying the configuration with Cgroup Driver: cgroupfs. I will see this week if it solves the gpu problem.