Hello!
I’m looking for help to solve some problems with Gigabyte 4090.
I use it to run some models, but it seems like GPU is fallen off the bus (after executing nvidia-smi i get “Unable to determine the device handle for GPU0000:01:00.0: Unknown Error”).
I’ve already read many posts with suggestions to limit power and lower frequencies, tried it and also got the same result in the end.
Nvidia-smi dmon works correctly and stops displaying the temperature after the device drops off the bus.
I don’t know if this information will help, but also the system is unstable at power up - I get the following errors at power up ([ 4.602996] pcieport 0000:00:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 4.602997] pcieport 0000:00:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 4.602998] pcieport 0000:00:01.0: device [8086:a70d] error status/mask=00001000/00002000
[ 4.602998] pcieport 0000:00:00:01.0: [12] Timeout )
At boot time, the operating system can either stop at this point or continue booting and do so successfully.
I’ve also tried different driver versions (from 525 to 545, doesn’t help)
Hardware specs are - 1000W PSU, 2 SSDs, 13900F with liquid cooling, 32RAM, Gigabyte 4090.
OS: Ubuntu server 22.04 LTS
nvidia-bug-report.log (1.7 MB)