I am using a remote linux server with 1 rtx 3060. The computation hangs after one hour or two. With the following error:
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU
The system is 11400f + gigabyte z490 ultra (pcie 4.0) + rtx 3060. I’ve tried to reinstall the system (18.04 & 20.04), reinstall the driver (several versions), and none of them helps.
As for the log, I generated two logs, the first is generated when the gpu is not lost yet, the second is generated when the gpu is lost.
Thanks in advance : -)