Description
First and foremost, I appreciate any help/guidance that anyone can provide me. I am pretty much at my wits ends.
I have a dual GPU desktop (System76). One of the two gpus (one in the top slot, both are 4090) will randomly fall off the bus. I haven’t been able to correlate it to any load. It can happen when all I am doing is reading email, or it might happen when I am running a machine learning program. Sometimes the system will go a few days without any problems, at the moment it has fallen off 3 times in the past 24 hours. And the only way out after that is to press off the power button. Rebooting from the GUI or the command line hangs.
The system is fairly new (about 6 months). It has been shipped back to the manufacturer multiple times and returned with the comment that they could not replicate the problem and hence cannot repair.
I am linking the output of nvidia-bug-report.sh to this topic in case it is helpful. The link is: Output of nvidia-bug-report.sh
The relevant section of dmesg is as follows:
[154565.721840] NVRM: GPU at PCI:0000:41:00: GPU-9acb9e54-2e15-8cdf-829f-c07a633c9f96
[154565.721845] NVRM: Xid (PCI:0000:41:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[154565.721847] NVRM: GPU 0000:41:00.0: GPU has fallen off the bus.
[154568.708290] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
[154568.708298] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[154568.708301] {1}[Hardware Error]: event severity: corrected
[154568.708304] {1}[Hardware Error]: Error 0, type: corrected
[154568.708306] {1}[Hardware Error]: fru_text: PcieError
[154568.708309] {1}[Hardware Error]: section_type: PCIe error
[154568.708312] {1}[Hardware Error]: port_type: 4, root port
[154568.708314] {1}[Hardware Error]: version: 0.2
[154568.708317] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[154568.708320] {1}[Hardware Error]: device_id: 0000:40:01.1
[154568.708324] {1}[Hardware Error]: slot: 0
[154568.708326] {1}[Hardware Error]: secondary_bus: 0x41
[154568.708328] {1}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1483
[154568.708331] {1}[Hardware Error]: class_code: 060400
[154568.708334] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0012
[154568.709979] pcieport 0000:40:01.1: AER: aer_status: 0x00000040, aer_mask: 0x00000000
[154568.709987] pcieport 0000:40:01.1: [ 6] BadTLP
[154568.709993] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
[155251.000557] snd_hda_codec_hdmi hdaudioC2D0: HDMI: invalid ELD buf size -1
(base) aganatra@system76-pc:~/Documents/system-logs$ nvidia-smi -i 0
Unable to determine the device handle for GPU0000:41:00.0: Unknown Error
Environment
TensorRT Version: NA
GPU Type: NVIDIA GeForce RTX 4090
Nvidia Driver Version: 545.29.06
CUDA Version:
CUDNN Version:
Operating System + Version: Linux system76-pc 6.6.10-76060610-generic #202401051437~1704728131~22.04~24d69e2~dev-Ubuntu SMP PREEMPT_DY x86_64 x86_64 x86_64 GNU/Linux