I recently installed nvidia driver 510.47.03, and for the second time when leaving the machine unattended for a few hours, it is unresponsive. The screen is turned on, but moving the mouse and using the keyboard does not prompt any response. The kernel log is quite suggestive that the nvidia kernel module is to blame.
Hi edmcman,
Can you please downgrade nvidia driver and confirm if you still sees the same issue and problem is with only nvidia driver 510.47.03.
I tried duplicating issue locally on one of my test system but no luck.
I already did that because I didn’t want my machine to lock up each night :-) I downgraded to 455.45.01 because I think that is what I was using previously. Is that too old, or would you like me to try a newer version? If so, what do you recommend? I don’t really understand the driver numbering convention.
My machine did not lock up last night on 455.45.01. I should say that I’ve only experienced the lock up on 2 nights. It was the first two nights I was using the 510.47.03 driver, but not a ton of data points.
Thanks for providing data point, we are trying to analyze issue with the provided logs as I do not have local repro.
There is Xid 61 in the dmesg log at the time of crash but the full bug report is taken after reboot. So I cannot see anything wrong in the bug report. Can you please take bug report again as soon as Xid 61 is seen (without reboot and may be over ssh if the system becomes inaccessible?)
Ok, I am back on 510, and should be set up to automatically run the nvidia debug script tomorrow morning. So if the problem occurs tonight and the machine is still functioning enough to do stuff, I should have a debug log. 🤞
Bad news. There was a hang last night, but I was unable to collect a log. I scheduled the log to be collected using at at 7:00am. But the log was collected shortly after I restarted the machine, which suggests that the machine was not functional enough to run at. In retrospect, this is not surprising, as the kernel log is also completely empty for several hours after the hang, and this never happens during normal operation.
What can I do to help debug? I don’t have much experience debugging kernel crashes. I did not try to use magic sysrq, but in retrospect I probably should have.
One more update. I had another lock up. I was unable to do anything with magic sysrq. It did seem to be responding, but after I did alt + magic sysrq + v to restore the framebuffer console, the machine stopped responding to further magic sysrq commands.
For some reason I went a long time without any crashes, but finally had one. I was able to collect a kdump, but I don’t have the debug symbols for that kernel so it’s mostly useless. I’m rebuilding the kernel so I have the debug symbols.
More good news. I was able to get a kdump during one of the soft lockups. In fact, it only took a few hours. I think that perhaps soft lockups may have been occurring before the machine becomes unusable. This means that perhaps I can actually run the nvidia debugging tool. Fingers crossed for that.
The kdump is a memory dump rather than a log file. I can’t share it directly because this is my work machine and it might have sensitive information in it. If you are familiar with kdumps, I can extract any information you are interested in via the crash tool.
I’m still optimistic that I’ll be able to run the nvidia log collector, but no luck on that yet.
Indeed the attached information doesn’t have useful information and unfortunately I am not able to reproduce issue so far.
Request you to try with latest released driver and share bug report if issue persists.