GPUs give ERR! with NVRM: Xid (PCI:0000:b5:00): 61

serdar · July 22, 2019, 11:10am

Hello,

We have recently purchased two workstations and both of them have 4 x RTX 2080 Ti GPUs attached.

We started experiencing an issue from the first day. After we start training our models using Kaldi, we get the following error at some point and nvidia-smi indicates one of the GPUs as Err!

We have the same issue on both workstations. We tried Ubuntu 16-18-19 versions, Centos 7.6. We even upgraded the Kernel once to 5.1.

We have installed different versions of Cuda and GPU drivers. We ran GPU burn and upgraded our BIOS to the latest version. Nothing improved the situation.

We get this error from all GPUs but one GPU at a time. It is reproducible, it sometimes takes few hours to reproduce this issue, sometimes over 24 hours, but we always get this error and we have to restart the workstation in order to recover from this situation

nvidia-bug-report.log.gz (1.96 MB)

generix · July 22, 2019, 12:23pm

There are several problems that could cause this:

please set nvidia-persistenced to start on boot and continuously runnning.
running cuda and driving an Xserver: [url]https://devtalk.nvidia.com/default/topic/1043126/linux/xid-8-in-various-cuda-deep-learning-applications-for-nvidia-gtx-1080-ti/post/5291377/#5291377[/url]
from experience, the 2080ti’s are sensitive to heat, does this also occur if you’re only running with two of them with free space inbetween? Also monitor temperatures using nvidia-smi.

serdar · July 22, 2019, 12:45pm

Thanks for the quick answer.

I forgot to mention that we tried both of them.

please set nvidia-persistenced to start on boot and continuously runnning.

We did this once and nothing has changed.

running cuda and driving an Xserver: Xid 8 in various CUDA deep learning applications for Nvidia GTX 1080 Ti - Linux - NVIDIA Developer Forums

We installed headless linux without an X server but it didn’t help either.

from experience, the 2080ti’s are sensitive to heat, does this also occur if you’re only running with two of them with free space inbetween? Also monitor temperatures using nvidia-smi.

We haven’t tried this, but we tested with GPU burn multiple times and couldn’t reproduce the issue there. We’ll try this too. Thanks.

Topic		Replies	Views
2080Ti got ERR soon after starting DL training Linux	11	1809	February 2, 2019
only one RTX 2080Ti can be found after reboot when there is fan err and voltage err by nvidia-smi Linux	2	508	April 24, 2019
Random Xid 61 and GPU disappears (RTX 2080 Ti, 440.64 driver, Ubuntu 20.04) Linux	1	793	June 15, 2020
NVIDIA-SMI Shows ERR! on both Fan and Power Usage Linux	32	47791	August 30, 2022
Rtx2080 ti - err - xid 61 Linux	3	1419	May 22, 2021
GTX 1080Ti keeps crashing while under CUDA load and "disappears" from the system until reboot Linux	1	667	January 16, 2019
Ubuntu 18.04 with 4 RTX 2080 Ti boot issue & freeze & CUDA errors Linux	26	5331	October 12, 2021
CentOS 8/Driver 440.33 Tesla V100: nvidia-smi reports error 62 Linux	4	1933	October 12, 2021
Unable to determine the device handle for GPU :GPU is lost Linux	10	31977	August 11, 2021
GPU randomly lost Linux cuda , ubuntu	2	646	July 27, 2023

GPUs give ERR! with NVRM: Xid (PCI:0000:b5:00): 61

Related topics