I have a GTX1650M / Max-Q on a chinese mini-pc (Chatreey G1) running under Linux 6.10.5, driver version 555.58.02.
It has been running fine for a while for very lightweight games (mostly emulators). Recently though, the dmesg output started to show this:
[ 41.390463] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 41.871813] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 42.357223] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 42.845124] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 43.337150] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 43.831988] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 44.320729] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 44.808054] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 45.294736] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 45.784977] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 46.273888] NVRM: Going over RM unhandled interrupt threshold for irq 147
[ 46.762339] NVRM: Going over RM unhandled interrupt threshold for irq 147
…and a while later the video hanged. Rebooting via SSH also hanged the machine.
Now the video board does not work anymore and dmesg shows:
[ 15.137479] NVRM: GPU at PCI:0000:01:00: GPU-06f3f9aa-1395-f73a-6b8b-bb2a46a39134
[ 15.137482] NVRM: Xid (PCI:0000:01:00): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0
[ 15.138567] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xb:2477)
[ 15.138871] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 15.284012] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x40:2477)
[ 15.284345] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 16.363257] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x40:2477)
This happens on a number of driver versions I’ve tried (except some of them do not print the Xid
line). The DRAM address mentioned is consistently the same.
nvidia-smi
says “No devices were found” where previously the board did show up.
Nothing has changed at all between the board working and this, and I’m pretty sure no overheating has occurred – chassis is well-ventilated, no heavy load, fans working.
Is this a sign of a dead board? What could be the cause?