Second GPU not detected on Ubuntu 18.04.4

Hi everyone!

I tried to find a topic with the same issue I encountered but I couldn’t so decided to post my problem here.

First, here is my system:

  • Motherboard: Asus WS X299 SAGE
  • GPUs: x2 NVIDIA RTX P8000
  • Driver: 450.57
  • Cuda compilation tools, release 10.0, V10.0.130
  • Ubuntu 18.04.4

Here is an example of what I got with nvidia-smi when everything was normal:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:67:00.0 Off | Off |
| 33% 32C P8 11W / 260W | 1MiB / 48601MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 8000 On | 00000000:68:00.0 Off | Off |
| 33% 36C P8 14W / 260W | 18MiB / 48598MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 1434 G /usr/lib/xorg/Xorg 10MiB |
| 1 N/A N/A 1510 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+

Last week I couldn’t detect my GPUs, when I ran nvidia-smi I got Unable to determine the device handle for GPU 0000:6700.0: Unknown Error. So I restarted my workstation but since I have been able to detect only one GPU. Here is nvidia-smi now:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:68:00.0 Off | Off |
| 33% 30C P8 9W / 260W | 336MiB / 48598MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1317 G /usr/lib/xorg/Xorg 39MiB |
| 0 N/A N/A 1373 G /usr/bin/gnome-shell 29MiB |
| 0 N/A N/A 2385 G /usr/lib/xorg/Xorg 122MiB |
| 0 N/A N/A 2518 G /usr/bin/gnome-shell 95MiB |
| 0 N/A N/A 2572 G …mviewer/tv_bin/TeamViewer 16MiB |
| 0 N/A N/A 27076 G gnome-control-center 27MiB |
±----------------------------------------------------------------------------+

It would be great if you guys could give some help.

Thanks in advance.

Hi! Maybe i have the same problem!


Can you show output from this:

dmidecode --type baseboard

uname -r

lspci -s 67:00.0 -vv

lspci -s 68:00.0 -vv

cat /var/log/syslog | grep 67:

cat /var/log/syslog | grep 68:

Hi! here are de successive output:

  • dmidecode --type baseboard:

# dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: WS X299 SAGE
Version: Rev 1.xx
Serial Number: 191060840100289
Asset Tag: Default string
Features:
Board is a hosting board
Board is replaceable
Location In Chassis: Default string
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

Handle 0x0030, DMI type 10, 6 bytes
On Board Device Information
Type: Video
Status: Enabled
Description: To Be Filled By O.E.M.

Handle 0x0064, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 1
Bus Address: 0000:00:00.0

Handle 0x0065, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 2
Bus Address: 0000:00:04.0

Handle 0x0066, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 3
Bus Address: 0000:00:04.1

Handle 0x0067, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 4
Bus Address: 0000:00:04.2

Handle 0x0068, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 5
Bus Address: 0000:00:04.3

Handle 0x0069, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 6
Bus Address: 0000:00:04.4

Handle 0x006A, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 7
Bus Address: 0000:00:04.5

Handle 0x006B, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 8
Bus Address: 0000:00:04.6

Handle 0x006C, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 9
Bus Address: 0000:00:04.7

Handle 0x006D, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 10
Bus Address: 0000:00:05.0

Handle 0x006E, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 11
Bus Address: 0000:00:05.2

Handle 0x006F, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 12
Bus Address: 0000:00:05.4

Handle 0x0070, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 13
Bus Address: 0000:00:08.0

Handle 0x0071, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 14
Bus Address: 0000:00:08.1

Handle 0x0072, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 15
Bus Address: 0000:00:08.2

Handle 0x0073, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 16
Bus Address: 0000:00:14.0

Handle 0x0074, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 17
Bus Address: 0000:00:14.2

Handle 0x0075, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 18
Bus Address: 0000:00:16.0

Handle 0x0076, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - SATA
Type: SATA Controller
Status: Enabled
Type Instance: 1
Bus Address: 0000:00:17.0

Handle 0x0077, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 19
Bus Address: 0000:00:1f.0

Handle 0x0078, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 20
Bus Address: 0000:00:1f.2

Handle 0x0079, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Sound
Type: Sound
Status: Enabled
Type Instance: 1
Bus Address: 0000:00:1f.3

Handle 0x007A, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Other
Type: Other
Status: Enabled
Type Instance: 21
Bus Address: 0000:00:1f.4

Handle 0x007B, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard - Ethernet
Type: Ethernet
Status: Enabled
Type Instance: 1
Bus Address: 0000:00:1f.6

  • uname -r:

5.4.0-53-generic

  • lspci -s 67:00.0 -vv:
    nothing returned

  • lspci -s 68:00.0 -vv:

68:00.0 VGA compatible controller: NVIDIA Corporation Device 1e30 (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 129e
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 106
NUMA node: 0
Region 0: Memory at d7000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at c0000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at d0000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at b000 [size=128]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

  • cat /var/log/syslog | grep 67::
    nothing returned

  • cat /var/log/syslog | grep 68::
    nothing returned

Hm… You GPU don’t available on PCI-line. Can you try switch GPU it places.
And show output:

lspci -tvv

cat /var/log/syslog

dmidecode --type 0

Hi!
Sorry for the delay. So today I tried power off + start (instead of restart) and the second GPU reapeared…So it seems to work again now. I don’t know what happened…