I try to pre-train nanochat with dgx spark but reached hardware-level power suddenly cut issue. Anyone has similar experience? How did you resolve the issue?
based on nvidia-bug-report.log, including kernel message
ACPI: thermal: [Firmware Bug]: No valid trip points!
I cannot find other obvious issue in the log, please guide me if you have recommendation. However, the machine (dgx spark) just shuts down suddenly during the training. I did see CPU temperature is over 95 C from time to time. I assume that was the root of issue.