GPU failed off the bus

Hello!

I’m looking for help to solve some problems with Gigabyte 4090.
I use it to run some models, but it seems like GPU is fallen off the bus (after executing nvidia-smi i get “Unable to determine the device handle for GPU0000:01:00.0: Unknown Error”).
I’ve already read many posts with suggestions to limit power and lower frequencies, tried it and also got the same result in the end.
Nvidia-smi dmon works correctly and stops displaying the temperature after the device drops off the bus.
I don’t know if this information will help, but also the system is unstable at power up - I get the following errors at power up ([ 4.602996] pcieport 0000:00:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 4.602997] pcieport 0000:00:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 4.602998] pcieport 0000:00:01.0: device [8086:a70d] error status/mask=00001000/00002000
[ 4.602998] pcieport 0000:00:00:01.0: [12] Timeout )
At boot time, the operating system can either stop at this point or continue booting and do so successfully.

I’ve also tried different driver versions (from 525 to 545, doesn’t help)

Hardware specs are - 1000W PSU, 2 SSDs, 13900F with liquid cooling, 32RAM, Gigabyte 4090.
OS: Ubuntu server 22.04 LTS
nvidia-bug-report.log (1.7 MB)

1 Like

Please update the mainboard’s bios, AER from :01.0 points to general pcie issues. That bridge isn’t even connected to the nvidia gpu which is on bridge :03.1.

Thank you for the answer!

Succesfully Updated BIOS for this PRIME B760M-A WIFI from 1604 to 1630 version. For now, looks like nothing changed. (same errors appears on starting system and when i run some ml workloads).
Will try other slot for GPU…
nvidia-bug-report.log (639.5 KB)