I purchased 2 NVIDIA DGX Spark units on October 25, 2025. One unit (Spark-B) works perfectly. The second unit (Spark-A) repeatedly crashes under sustained GPU load.
Issue:
Unit powers off abruptly during GPU inference workload
No kernel errors or warnings in logs β just stops mid-operation
Occurs within 30-60 minutes of sustained GPU use
4 crashes in 3 days under normal workload
Comparison:
Identical Spark-B unit running same workload, same duration β zero crashes
Same software, same power source, same environment
Evidence attached:
Purchase receipt
System logs showing crash pattern (boots end mid-inference with no errors)
for the βSpark Bβ unit, please install and execute Field Diag NVIDIA DGX Spark Field Diagnostics | NVIDIA , then share the resulting logs with me (DM me please). Thank you.
Based on the log bundles you shared, Field Diagnostics did not run successfully. The logs indicate that the GPU driver could not be unloaded.
This is likely because a process or service is currently using the GPU. The driver must be fully unloaded before Field Diagnostics can run.
Please power cycle the unit, then run Field Diagnostics again.
Launching field diagnostics ...
Command Line: ./partnerdiag --field
Removing Nvidia drivers and services...
Stopping 'docker.service', but its triggering units are still active:
docker.socket
Stopping 'systemd-udevd.service', but its triggering units are still active:
systemd-udevd-control.socket, systemd-udevd-kernel.socket
rmmod: ERROR: Module nvidia_drm is in use
rmmod: ERROR: Module nvidia_drm is in use
rmmod: ERROR: Module nvidia_drm is in use
rmmod: ERROR: Module nvidia_drm is in use