I have a System76 Mira R3 machine with an RTX4090 GPU. It has been working great for about 3 weeks since I got it, but all of a sudden a couple of days ago it started having problems. The GPU fan will spin up to what sounds like max speed and the GPU will stop responding. nvidia-smi just gives an error “Unable to determine the device handle for GPU0000:01:00.0: Unknown Error” and I see this in the system log:
[Fri Jul 7 14:06:55 2023] NVRM: GPU at PCI:0000:01:00: GPU-a8a22861-ba33-c1b6-e2f7-ec993989ad48
[Fri Jul 7 14:06:55 2023] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Fri Jul 7 14:06:55 2023] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Everything else is still working while this is going on, but when I try to shutdown the system hangs.
I’m attaching the log output from nvidia-bug-report.sh.
This is a Pop OS 22.04 Linux system that I just bought about a month ago. I am using the 4090 for machine learning model training. There are no monitors connected to the 4090. I also have a GT1030 in the system that is driving a second monitor and it’s having no problems. I haven’t been able to track down anything that is triggering it to happen. It’s happened while I’m building a model, it’s happened when I’m just reading email and it’s even happened when I’m not doing anything and the system is idle. Is there anything I can do to fix this or is the GPU itself bad and need to be replaced?
nvidia-bug-report.log.gz (254.7 KB)