Rtx2080 ti - err - xid 61

Dear all,

I use a ZOTAC RTX2080TI GPU for training deep learning models for my phd research. From yesterday, during the inference of a resnet model on Pytorch, I got some problems. After a long wait in which the code was not executed, the nvidia-smi on Linux gives ERR on fan and power measurement. Actually in the nvidia-smi log I got XID error 61 and 45. I rebooted, and did the reinit of kernel module but not nvidia-smi show “No devices found” even if the GPU is in the lspci list.
Is the GPU died? It could be a power issue?

Thank you for all your help

Alessandro

Hey there, I’m have a few issues understanding your problem.

  • Could you share the log. Use sudo nvidia-bug-report.sh to generate it. It will be helpful to others and will conform to community guidelines.
  • Could you share the Linux kernel version (get it using uname -a) and the NVIDIA Drivers version (should be in your previous / working nvidia-smi output or in the logs generated above)

Right off my initial understanding, we can infer the following

  1. If you refer to the XID errors document, the XID error 45 is probably because of an abnormal application exit. XID 61 is an internal microcontroller issue (breakpoint / warning).
  2. Probably, the issue is that the driver isn’t getting initialized correctly.

Retry driver installation

If you’re using Ubuntu, Additional Drivers application under Software is an appropriate place to find and install compatible drivers. Try this:

  1. Reboot
  2. Install the xserver-xorg-video-nouveau drivers through the app
  3. Reboot. You will not be able to use NVIDIA drivers yet.
  4. Install the appropriate drivers, you can try different ones in the menu, reboot, run nvidia-smi and your app, and see which ones work.

Let’s see if this gets us closer to solving the issue.

Dear Avneesh,

i actually don’t use graphical server on the GPU but only the CUDA part, this is the list of trials i did:

  • Reboot
  • Nvidia driver 450, 455, 460, 465
  • rmmod and modprob of nvidia_uvm

Actually, i was able to run for few minutes a training on my gpu but after 10 minutes of IDLE it was stucked again with ERR in power and fan and actually isn’t loaded by the OS.

Here it is my nvidia-report attached

Thank you for your efforts!
nvidia-bug-report.log.gz (271.1 KB)

Sorry for the second message but this is the report while nvidia-smi is in ERR!

nvidia-bug-report.log.gz (395.5 KB)