GPU failing? Or PCIE riser?

Hello,

I have a 6000 ADA GPU. It’s about 10 months old. It is connected to my motherboard through a Linkup PCIE 5 Riser cable. It has been working well for many months, but has recently started randomly failing. The fan will go to max power and nvidia-smi says “Unable to determine the device handle for GPU0000:42:00.0: Unknown Error”

I ran nvidia bug report, and it says PCIE errors. I tried reattaching the PCIE Riser cable to both the GPU and the motherboard, but the error is still happening. What could be the problem? Should I get a new riser cable? Is the GPU dying? What tests can I do to determine the root cause?

Also, when this happens is there anyway to shut the GPU off through software? The GPU is a remote server and I ssh in. When it shuts down I need to find someone else to reboot it (sudo reboot doesn’t work).

Bug report:
nvidia-bug-report.log.gz (2.8 MB)