Good day!
We have a DGX A100 system which relatively frequently hangs. By this I mean the machine shuts down, and all 6 power LED starts to blink with orange colour. The only way to restart it is to completely unplug all power supplies and replug them.
On the BMC site we are getting JFFS2 related errors like these two:
jffs2: Error garbage collecting node at 00083978! -
jffs2: Argh. No free space left for GC. nr_erasing_blocks is 0. nr_free_blocks is 0. (erasableempty: yes, erasingempty: yes, erasependingempty: yes) -
Is it possible to flush the JFFS2 file system or is this related to a hardware error?
Thank you for any ideas!
Hi @bertok.csanad1 ,
I’d start with erasing the System Event Log (ipmitool sel clear
for example), and then flashing the latest DGX A100 firmware (specifically re-flashing the BMC active and backup images). If that doesn’t clean up the filesystem, then contact NVIDIA Enterprise Support and they can help you out (see the pinned post at the top of this forum).
ScottE
Dear @ScottEllis!
Thank you very much for this idea! I will try to erase SEL when I can physically restart the machine (it is hanging right now :( ) The firmware and OS is the latest (we updated everything in hope that maybe it will solve the problem, but unfortunately it did not). If this will not work I’ll open a ticket to support and update the post when a solution is found.
Thank you!
UPDATE: the problem was not related to the System Event Log. The errors mentioned in my original post were the consequence of the main problem and not the cause.
After a discussion with the extremely helpful support team we figured out that during a hang, the physical 7 segment display was displaying the code Pd-4b (alternating between Pd and 4b) which meant a CPU tray power delivery problem. After replacing the CPU tray everything works.
Maybe this helps others in the future :)
1 Like