GPUs give ERR! with NVRM: Xid (PCI:0000:b5:00): 61

Hello,

We have recently purchased two workstations and both of them have 4 x RTX 2080 Ti GPUs attached.

We started experiencing an issue from the first day. After we start training our models using Kaldi, we get the following error at some point and nvidia-smi indicates one of the GPUs as Err!

We have the same issue on both workstations. We tried Ubuntu 16-18-19 versions, Centos 7.6. We even upgraded the Kernel once to 5.1.

We have installed different versions of Cuda and GPU drivers. We ran GPU burn and upgraded our BIOS to the latest version. Nothing improved the situation.

We get this error from all GPUs but one GPU at a time. It is reproducible, it sometimes takes few hours to reproduce this issue, sometimes over 24 hours, but we always get this error and we have to restart the workstation in order to recover from this situation

nvidia-bug-report.log.gz (1.96 MB)

There are several problems that could cause this:

  1. please set nvidia-persistenced to start on boot and continuously runnning.
  2. running cuda and driving an Xserver: https://devtalk.nvidia.com/default/topic/1043126/linux/xid-8-in-various-cuda-deep-learning-applications-for-nvidia-gtx-1080-ti/post/5291377/#5291377
  3. from experience, the 2080ti’s are sensitive to heat, does this also occur if you’re only running with two of them with free space inbetween? Also monitor temperatures using nvidia-smi.

Thanks for the quick answer.

I forgot to mention that we tried both of them.

  1. please set nvidia-persistenced to start on boot and continuously runnning.

We did this once and nothing has changed.

  1. running cuda and driving an Xserver: https://devtalk.nvidia.com/default/topic/1043126/linux/xid-8-in-various-cuda-deep-learning-applications-for-nvidia-gtx-1080-ti/post/5291377/#5291377

We installed headless linux without an X server but it didn’t help either.

  1. from experience, the 2080ti’s are sensitive to heat, does this also occur if you’re only running with two of them with free space inbetween? Also monitor temperatures using nvidia-smi.

We haven’t tried this, but we tested with GPU burn multiple times and couldn’t reproduce the issue there. We’ll try this too. Thanks.