Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU

I am using a remote linux server with 1 rtx 3060. The computation hangs after one hour or two. With the following error:

The system is 11400f + gigabyte z490 ultra (pcie 4.0) + rtx 3060. I’ve tried to reinstall the system (18.04 & 20.04), reinstall the driver (several versions), and none of them helps.

As for the log, I generated two logs, the first is generated when the gpu is not lost yet, the second is generated when the gpu is lost.

Thanks in advance : -)

nvidia-bug-report.gpu_not_lost.log (808.9 KB)
nvidia-bug-report.gpu_lost.log (809.5 KB)

You’re running into XID 79, possibly due to either overheating or PSU problems.

Thanks, I’ll replace the PSU first.

I set the power limit to 100w with nvidia-smi. The temperature is kept below 50 degree c. The power i use is a 600w gold. Yet the gpu is still lost within ~1 hour…

Any suggestion would be appreciate…

Setting a power-limit doesn’t prevent power spikes due to clock boost. To check if this is a power problem, please try limiting clocks:
sudo nvidia-smi -lgc 300,1500
Furthermore, please make sure the Xserver is disabled and nvidia-persistenced is properly set up and enabled.

I set the power limit and clock with:

sudo nvidia-smi -pm 1
sudo nvidia-smi -lgc 300, 1500
sudo nvidia-smi -pl 100

after 3 hours, the gpu is lost again… any suggestions?

Previously the system runs smoothly, the issue occurs after I changed the motherboard and cpu (from 3700x+b450 to 11400f+z490, for PCIE4.0). Wondering whether motherboard could be the reason?

Thanks a lot!

Now that you mention it

nvidia 0000:02:00.0: AER: can't recover (no error_detected callback)

so seemingly pcie problems. Please check for a bios update, try reseating the card in its slot. if that doesn’t help, try setting pcie gen 3 in bios.

Might also be cooling problems on the mainboard, iirc, many pci gen 4 mainboards require active cooling.

Issue solved. Turn out to be the motherboard problem.
For MSI Z490 motherboards, 11gen CPU would cause XMP fail.
For Gigabyte Z490 motherboards, 11gen CPU w. GPU can only operate in PCIe 4.0 x 8 mode, x16 causes reboot every hour.
For ASUS Z490 motherboards, they simply don’t support pcie4.0