Unable to determine the device handle for GPU0000:2E:00.0: Unknown Error

My linux has 2 GPUs and one of them constantly seems to fall out and I need to reboot in order to fix it. I can’t figure out what the issue is, as soon as I reboot it works fine – which leads me to believe it’s probably not hardware? I have a 1200w PSU so power shouldn’t be an issue.

Any ideas?

Attached is a bug report from it
nvidia-bug-report.log.gz (266.8 KB)

The 3060 is still on, so I doubt power issues, but reporting pcie lane errors on all lanes. You have in parallel bound both gpus to the vfio-pci driver for passthrough, this should not be done. Furthermore, it might be an overheating issue, please monitor temperatures.

Thanks @generix! This GPU is connected via a ASUS RS200 ROG Strix Riser Cable – do you think this could be the issue? Or do you think something could be wrong with the GPU? I believe I tried swapping them and i don’t think I had any issues with another GPU on this riser… But its been a while so I might need to reconfirm that to be 100% sure.

I strongly suspect the risers being the issue.

I was worried about that… I dont have enough room to put both of these directly on the board so I had to use a riser here. Do you think I should just try and swap that out first?

You might check if you can lower pcie speed/gen in bios for a workaround.