I am setting up a system with two RTX2080TI cards and an NVLink between them. I installed the NVIDIA drivers and CUDA successfully, but although both cards show up in the BIOS, the system is only seeing one. Here are a few outputs that people suggest I get for diagnostics:
$ nvidia-smi
Thu Dec 12 14:36:55 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:65:00.0 On | N/A |
| 0% 38C P8 23W / 250W | 464MiB / 11016MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1419 G /usr/lib/xorg/Xorg 40MiB |
| 0 1574 G /usr/bin/gnome-shell 57MiB |
| 0 1767 G /usr/lib/xorg/Xorg 240MiB |
| 0 1904 G /usr/bin/gnome-shell 122MiB |
+-----------------------------------------------------------------------------+
$ dmesg | grep NVRM
[ 3.699280] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.44 Sun Dec 8 03:38:56 UTC 2019
[ 4.446869] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[ 4.446891] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 11.010333] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[ 11.010363] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 16.047532] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 16.047561] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 37.753016] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 37.753043] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 41.949244] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 41.949262] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 52.873765] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 52.873791] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 56.998022] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 56.998045] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 61.102604] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 61.102656] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 65.246270] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 65.246302] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 69.354837] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 69.354889] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 73.462545] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 73.462576] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 77.570629] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 77.570663] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 162.235977] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 162.236010] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
[ 169.828936] NVRM: GPU 0000:17:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 169.828980] NVRM: GPU 0000:17:00.0: rm_init_adapter failed, device minor number 0
$ lspci | grep VGA
17:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
Is this a driver or hardware issue? How can I fix it?