RmInitAdapter failed (repeatedly) for one of two RTX2080TI on Ubuntu 18.04

I am setting up a system with two RTX2080TI cards and an NVLink between them. I installed the NVIDIA drivers and CUDA successfully, but although both cards show up in the BIOS, the system is only seeing one. Here are a few outputs that people suggest I get for diagnostics:

$ nvidia-smi
Thu Dec 12 14:36:55 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:65:00.0  On |                  N/A |
|  0%   38C    P8    23W / 250W |    464MiB / 11016MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1419      G   /usr/lib/xorg/Xorg                            40MiB |
|    0      1574      G   /usr/bin/gnome-shell                          57MiB |
|    0      1767      G   /usr/lib/xorg/Xorg                           240MiB |
|    0      1904      G   /usr/bin/gnome-shell                         122MiB |
+-----------------------------------------------------------------------------+
$ dmesg | grep NVRM
[    3.699280] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.44  Sun Dec  8 03:38:56 UTC 2019
[    4.446869] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[    4.446891] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   11.010333] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[   11.010363] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   16.047532] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   16.047561] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   37.753016] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   37.753043] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   41.949244] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   41.949262] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   52.873765] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   52.873791] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   56.998022] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   56.998045] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   61.102604] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   61.102656] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   65.246270] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   65.246302] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   69.354837] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   69.354889] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   73.462545] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   73.462576] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[   77.570629] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[   77.570663] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[  162.235977] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[  162.236010] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[  169.828936] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[  169.828980] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
$ lspci | grep VGA
17:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)

Is this a driver or hardware issue? How can I fix it?

check power connections
reseat the GPU
swap the 2 GPUs and pay attention to serial numbers and PCIE bus IDs to see if the problem follows the GPU or follows the slot.

Swapping the 2 GPUs leads to this pattern on the monitor:

I suspect this means the GPU I just moved to the top is dead. Any other possibilities?

I had the exact same issue with 2 RTX 2080Ti and the 440.33.1 Driver and Cuda 10.2.
Removing the NVLink resolved the issue for me.

Unfortunately removing the NVLink did not change things for me. Currently arranging to get the card repaired.

Hey,

Were you able to resolve this? Was it GPU hardware failure or anything else?
I’m facing same issue with only 1 2080ti, so nothing related to NVLink.

Hi,

Yes, if you see the space invaders pattern in the imgur link, it’s a hardware failure. See eg Nvidia addresses failing GeForce RTX 2080 Ti cards | TechSpot

I got the card replaced.