"No GPU found" after trying to fix a performance issue

nvidia-bug-report.log.gz (142.2 KB)

Hello friends. We are running Fortran code on the GPU. However, we noticed that the GPU starts to slow down after about an hour of run time. The GPU utilization falls to around 5% (occasionally going up to 30%).

We tried many things to fix this problem, but nothing worked. We then tried changing the “Uncor. ECC” setting from “off” to “on” (based on a discussion on NVIDIA forum). This required a reboot, which was normal. However, after the reboot, the GPU driver did not load properly. The nvidia-smi and nvtop commands were returning "no GPU found”

I’m attaching the bug report in the hopes that one of you kind souls can shed some light into my issue. Any help is really appreciated!

No kind souls that can help me? please help me, Obi-Wan Kenobi, you’re my only hope.