I have a 6000 ADA GPU. It’s about 10 months old. It is connected to my motherboard through a Linkup PCIE 5 Riser cable. It has been working well for many months, but has recently started randomly failing. The fan will go to max power and nvidia-smi says “Unable to determine the device handle for GPU0000:42:00.0: Unknown Error”
I ran nvidia bug report, and it says PCIE errors. I tried reattaching the PCIE Riser cable to both the GPU and the motherboard, but the error is still happening. What could be the problem? Should I get a new riser cable? Is the GPU dying? What tests can I do to determine the root cause?
Also, when this happens is there anyway to shut the GPU off through software? The GPU is a remote server and I ssh in. When it shuts down I need to find someone else to reboot it (sudo reboot doesn’t work).