Nvidia command cannot see second GPU

Hi,

I had a weird problem regarding the second GPU attached to our workstation machine. At first when it was installed,nvidia-smi could see both GPU cards and we could run PyTorch programs to train models. However, after several times pausing the python programs and relaunching, nvidia-smi displayed ERROR for the second GPU card and subsequently, the information of the second GPU disappeared.

We tried to use lspci | grep VGA and it gives:

03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3b:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
d8:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)

which shows two Quadro RTX 8000 cards.

But with nvidia-smi, we got the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:D8:00.0 Off |                  Off |
| 33%   24C    P8     5W / 260W |      1MiB / 49152MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

We observed that nvidia-smi was much slower than usual after the second card could not be detected.

Please see the attached bug report: nvidia-bug-report.log.gz (505.4 KB)

We also tried:

  • Re-install the system and follow the instructions for the cuda toolkit 11.7.
  • Unplug and re-install the GPU cards.
    but none of these worked.

Please help us!

The bug report log file indicates that the GPU driver is unable to start the GPU at PCI address 3b.

lspci indicates the device was configured by the BIOS/OS:

/usr/bin/lspci -d "10de:*" -v -xxx

3b:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Dell TU102GL [Quadro RTX 6000/8000]
	Flags: bus master, fast devsel, latency 0, IRQ 123, NUMA node 0
	Memory at ab000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 38bfe0000000 (64-bit, prefetchable) [size=256M]
	Memory at 38bff0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at 6000 [size=128]
	Expansion ROM at ac080000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
00: de 10 30 1e 07 00 10 00 a1 00 00 03 00 00 80 00
10: 00 00 00 ab 0c 00 00 e0 bf 38 00 00 0c 00 00 f0
20: bf 38 00 00 01 60 00 00 00 00 00 00 28 10 9e 12
30: 00 00 00 00 60 00 00 00 00 00 00 00 ff 01 00 00
40: 28 10 9e 12 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00
60: 01 68 c3 c9 08 00 00 00 05 78 80 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 10 00 12 00 e1 8d 2c 11
80: 1e 21 10 00 03 3d 45 00 40 01 01 11 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00
a0: 06 00 00 00 0e 00 00 00 03 00 1f 00 00 00 00 00
b0: 00 00 00 00 09 00 14 01 00 00 13 0b 80 00 00 00
c0: e6 7b 7e c8 00 00 00 00 11 00 05 00 00 00 b9 00
d0: 00 00 ba 00 00 00 00 00 00 00 00 00 28 10 9e 12
e0: 28 10 9e 12 03 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

3b:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
	Subsystem: Dell TU102 High Definition Audio Controller
	Flags: bus master, fast devsel, latency 0, IRQ 121, NUMA node 0
	Memory at ac050000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
00: de 10 f7 10 06 00 10 00 a1 00 03 04 00 00 80 00
10: 00 00 05 ac 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 9e 12
30: 00 00 00 00 60 00 00 00 00 00 00 00 ff 02 00 00
40: 28 10 9e 12 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 ce d6 23 00 00 00 00 00
60: 01 68 03 00 08 00 00 00 05 78 80 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 10 00 02 00 e1 8d 2c 01
80: 1e 29 09 00 03 3d 45 00 43 01 01 11 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00
a0: 06 00 00 00 0e 00 00 00 00 00 01 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

3b:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
	Subsystem: Dell TU102 USB 3.1 Host Controller
	Flags: fast devsel, IRQ 55, NUMA node 0
	Memory at ac000000 (64-bit, prefetchable) [size=256K]
	Memory at ac040000 (64-bit, prefetchable) [size=64K]
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [b4] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci
00: de 10 d6 1a 02 04 10 00 a1 30 03 0c 20 00 80 00
10: 0c 00 00 ac 00 00 00 00 00 00 00 00 0c 00 04 ac
20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 9e 12
30: 00 00 00 00 68 00 00 00 00 00 00 00 ff 03 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 31 60 00 00 00 00 00 00 05 78 81 00 b8 00 e0 fe
70: 00 00 00 00 00 00 00 00 10 b4 02 00 e0 8d 2c 01
80: 1e 29 19 00 03 3d 45 00 40 00 01 11 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00
a0: 06 00 00 00 0e 00 00 00 00 00 01 00 00 00 00 00
b0: 00 00 00 00 01 00 43 c8 0b 01 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

3b:00.3 Serial bus controller: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
	Subsystem: Dell TU102 USB Type-C UCSI Controller
	Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 0
	Memory at ac054000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [b4] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: nvidia-gpu
	Kernel modules: i2c_nvidia_gpu
00: de 10 d7 1a 06 04 10 00 a1 00 80 0c 00 00 80 00
10: 00 40 05 ac 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 9e 12
30: 00 00 00 00 68 00 00 00 00 00 00 00 ff 04 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 05 78 81 00 78 00 e0 fe
70: 00 00 00 00 00 00 00 00 10 b4 02 00 e0 8d 2c 01
80: 1e 29 19 00 03 3d 45 00 40 00 01 11 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00
a0: 06 00 00 00 0e 00 00 00 00 00 01 00 00 00 00 00
b0: 00 00 00 00 01 00 43 c8 0b 01 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

d8:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Quadro RTX 8000
	Flags: bus master, fast devsel, latency 0, IRQ 126, NUMA node 1
	Memory at ef000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 39ffe0000000 (64-bit, prefetchable) [size=256M]
	Memory at 39fff0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	Expansion ROM at f0080000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
00: de 10 30 1e 07 04 10 00 a1 00 00 03 00 00 80 00
10: 00 00 00 ef 0c 00 00 e0 ff 39 00 00 0c 00 00 f0
20: ff 39 00 00 01 e0 00 00 00 00 00 00 de 10 9e 12
30: 00 00 00 00 60 00 00 00 00 00 00 00 ff 01 00 00
40: de 10 9e 12 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00
60: 01 68 c3 c9 08 00 00 00 05 78 81 00 18 01 e0 fe
70: 00 00 00 00 00 00 00 00 10 00 12 00 e1 8d 2c 11
80: 1e 21 10 00 03 3d 46 00 40 01 01 11 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00
a0: 06 00 00 00 0e 00 00 00 03 00 1f 00 00 00 00 00
b0: 00 00 00 00 09 00 14 01 01 00 13 0b 80 00 00 00
c0: 11 8e 3e d9 00 00 00 00 11 00 05 00 00 00 b9 00
d0: 00 00 ba 00 00 00 00 00 00 00 00 00 de 10 9e 12
e0: de 10 9e 12 03 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

d8:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
	Subsystem: NVIDIA Corporation TU102 High Definition Audio Controller
	Flags: bus master, fast devsel, latency 0, IRQ 122, NUMA node 1
	Memory at f0050000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
00: de 10 f7 10 06 00 10 00 a1 00 03 04 00 00 80 00
10: 00 00 05 f0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 de 10 9e 12
30: 00 00 00 00 60 00 00 00 00 00 00 00 ff 02 00 00
40: de 10 9e 12 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 ce d6 23 00 00 00 00 00
60: 01 68 03 00 08 00 00 00 05 78 80 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 10 00 02 00 e1 8d 2c 01
80: 1e 29 09 00 03 3d 45 00 43 01 01 11 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00
a0: 06 00 00 00 0e 00 00 00 00 00 01 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

d8:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
	Subsystem: NVIDIA Corporation TU102 USB 3.1 Host Controller
	Flags: fast devsel, IRQ 57, NUMA node 1
	Memory at f0000000 (64-bit, prefetchable) [size=256K]
	Memory at f0040000 (64-bit, prefetchable) [size=64K]
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [b4] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci
00: de 10 d6 1a 02 04 10 00 a1 30 03 0c 20 00 80 00
10: 0c 00 00 f0 00 00 00 00 00 00 00 00 0c 00 04 f0
20: 00 00 00 00 00 00 00 00 00 00 00 00 de 10 9e 12
30: 00 00 00 00 68 00 00 00 00 00 00 00 ff 03 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 31 60 00 00 00 00 00 00 05 78 81 00 b8 00 e0 fe
70: 00 00 00 00 00 00 00 00 10 b4 02 00 e0 8d 2c 01
80: 1e 29 19 00 03 3d 46 00 40 00 01 11 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00
a0: 06 00 00 00 0e 00 00 00 00 00 01 00 00 00 00 00
b0: 00 00 00 00 01 00 43 c8 0b 01 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

d8:00.3 Serial bus controller: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
	Subsystem: NVIDIA Corporation TU102 USB Type-C UCSI Controller
	Flags: bus master, fast devsel, latency 0, IRQ 53, NUMA node 1
	Memory at f0054000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [b4] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: nvidia-gpu
	Kernel modules: i2c_nvidia_gpu
00: de 10 d7 1a 06 04 10 00 a1 00 80 0c 00 00 80 00
10: 00 40 05 f0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 de 10 9e 12
30: 00 00 00 00 68 00 00 00 00 00 00 00 ff 04 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 05 78 81 00 78 00 e0 fe
70: 00 00 00 00 00 00 00 00 10 b4 02 00 e0 8d 2c 01
80: 1e 29 19 00 03 3d 46 00 40 00 01 11 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00
a0: 06 00 00 00 0e 00 00 00 00 00 01 00 00 00 00 00
b0: 00 00 00 00 01 00 43 c8 0b 01 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

But the driver is unable to start the device:

  /var/log/dmesg:
[    5.930340] kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[    6.025948] kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  515.65.01  Wed Jul 20 14:00:58 UTC 2022
[    6.064369] kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  515.65.01  Wed Jul 20 13:43:59 UTC 2022
[    6.074221] kernel: [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
[    6.897433] kernel: NVRM: GPU 0000:3b:00.0: RmInitAdapter failed! (0x25:0xffff:1428)
[    6.897579] kernel: NVRM: GPU 0000:3b:00.0: rm_init_adapter failed, device minor number 0
[    6.898608] kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00003b00] Failed to allocate NvKmsKapiDevice
[    6.908882] kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00003b00] Failed to register device
[    6.909612] kernel: [drm] [nvidia-drm] [GPU ID 0x0000d800] Loading driver
[    7.573856] kernel: NVRM: GPU 0000:3b:00.0: RmInitAdapter failed! (0x25:0xffff:1428)
[    7.573941] kernel: NVRM: GPU 0000:3b:00.0: rm_init_adapter failed, device minor number 0
[   12.784922] kernel: NVRM: GPU 0000:3b:00.0: RmInitAdapter failed! (0x23:0x65:1382)
[   12.808434] kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:d8:00.0 on minor 1
[   12.819560] kernel: NVRM: GPU 0000:3b:00.0: rm_init_adapter failed, device minor number 0

Further information isn’t available from the logs.

Try isolating the specific GPU. Plug one GPU into the system, power it up, note behavior. Then power down, remove the GPU, plug the other GPU into the system in exactly the same way (same slot, same aux power dongle), power up, note behavior. If one GPU works and the other doesn’t it is probably a GPU HW failure.

Quadro RTX cards should have a warranty to the original purchaser, check to see if it is expired or not.

I probably won’t be able to provide further assistance here.