I know this is going to sound like an “enterprise support” question per the pinned post, but ours expired and we are trying hard to get that in place with our current FY order of A100’s. So my hope is this won’t get deleted.
Our DGX-1 was reimaged locally with RHEL (7.8 I think). We’ve been using it successfully for months with no issues but this one. After a reboot, the fans spin up high and nvidia-smi never returns.
Basically I have this exact problem Opening nvidia devices leads to unrecoverable hangs + zombie processes. I duplicated the same behavior with strace. “strace nvidia-smi” hangs when it tries to open /dev/nvidiactl and it is unkillable.
I cannot find an existing process to kill. And I see no clues in either dmesg or /var/log/messages.
I asked our administrators about this and they said that the system did the same thing the last time it was rebooted and they told me it “eventually” cleared up and they did nothing to fix it.
Do you have any ideas as to why opening /dev/nvidiactl would cause a hard hang?