GPU fallen off bus

This problem has been happening almost every day recently and has seriously affected my work. The system only returns to normal briefly each time it restarts.
bugs:

May 25 23:40:57 xmu-NMR kernel: [ 1161.724285] NVRM: GPU at PCI:0000:89:00: GPU-4723a2af-e63f-10f1-256e-7b1bc2aa6692
May 25 23:40:57 xmu-NMR kernel: [ 1161.724288] NVRM: GPU Board Serial Number: 
May 25 23:40:57 xmu-NMR kernel: [ 1161.724291] NVRM: Xid (PCI:0000:89:00): 79, pid=7147, GPU has fallen off the bus.
May 25 23:40:57 xmu-NMR kernel: [ 1161.724294] NVRM: GPU 0000:89:00.0: GPU has fallen off the bus.
May 25 23:40:57 xmu-NMR kernel: [ 1161.724295] NVRM: GPU 0000:89:00.0: GPU is on Board .
May 25 23:40:57 xmu-NMR kernel: [ 1161.724324] NVRM: A GPU crash dump has been created. If possible, please run
May 25 23:40:57 xmu-NMR kernel: [ 1161.724324] NVRM: nvidia-bug-report.sh as root to collect this data before
May 25 23:40:57 xmu-NMR kernel: [ 1161.724324] NVRM: the NVIDIA kernel module is unloaded.

nvidia-bug-report.log.gz (3.6 MB)

I’ve been getting those recently too. Super annoying. Last time this happened I’ve noticed a red LED light up next to the power cables on my 3090. Seeing how this issue could be related to power supply, I’ve tried replugging both cables, both on the GPU side and PSU side. I’m hoping this will fix it but I’m not sure.

I’m wondering if this could be a driver issue given that it seems to have started recently. I run the latest one - 510.68.02.

Please try limiting boost clocks using nvidia-smi -lgc on boot to check if this is a psu issue.