5 out of 8 GPUs are not detected with nvidia-smi

Hello All!

I have a server with 8 Nvidia RTX 2080 TI GPUs, when I run nvidia-smi i get the following output :

Fri Mar 31 10:53:35 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti      On | 00000000:60:00.0 Off |                  N/A |
|  0%   26C    P8                1W / 250W|      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti      On | 00000000:B1:00.0 Off |                  N/A |
|  0%   23C    P8                1W / 250W|      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti      On | 00000000:DB:00.0 Off |                  N/A |
|  0%   23C    P8                1W / 250W|      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Only 3 GPU cards are detected (however I have 8 GPUs).
Below is the output of the bug-report script :
nvidia-bug-report.log (466.6 KB)

Hi @ameenali023!

Check the output of
/usr/bin/lspci -d "10de:*" -v -xxx???

You will find that your system only recognizes 3 GPUs to begin with.

  • Did you make sure that the board supports 8 PCIe cards?
  • The 4210 Xeon only has 48 PCIe lanes which means you can run 8 GPUs with max. PCIe x4 each. You might need to change that in your BIOS. You seem to have a dual Xeon server, so you might be able to increase to x8 depending on other Hardware like NvME SSDs using up PCIe lanes
  • Do you have sufficient power supplied to the GPUs?
  • Try switching them out between slots or simply re-seating them

In general there is no guarantee that this will work at all since the 2080TI is NOT a server certified GPU.

I hope this helps.

Thank’s for your reply @MarkusHoHo
Below is the output of the command :

60:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Gigabyte Technology Co., Ltd TU102 [GeForce RTX 2080 Ti]
	Flags: bus master, fast devsel, latency 0, IRQ 176, NUMA node 0
	Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 7ffe0000000 (64-bit, prefetchable) [size=256M]
	Memory at 7fff0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at 9000 [size=128]
	Expansion ROM at c5000000 [virtual] [disabled] [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
lspci: Unable to load libkmod resources: error -12
00: de 10 04 1e 07 05 10 00 a1 00 00 03 00 00 80 00
10: 00 00 00 c4 0c 00 00 e0 ff 07 00 00 0c 00 00 f0
20: ff 07 00 00 01 90 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 60 00 00 00 00 00 00 00 0b 01 00 00

60:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd TU102 High Definition Audio Controller
	Flags: bus master, fast devsel, latency 0, IRQ 173, NUMA node 0
	Memory at c5080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: snd_hda_intel
00: de 10 f7 10 06 01 10 00 a1 00 03 04 08 00 80 00
10: 00 00 08 c5 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 60 00 00 00 00 00 00 00 0a 02 00 00

60:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
	Subsystem: Gigabyte Technology Co., Ltd TU102 USB 3.1 Host Controller
	Flags: fast devsel, IRQ 49, NUMA node 0
	Memory at 7fff2000000 (64-bit, prefetchable) [size=256K]
	Memory at 7fff2040000 (64-bit, prefetchable) [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: xhci_hcd
00: de 10 d6 1a 02 05 10 00 a1 30 03 0c 10 00 80 00
10: 0c 00 00 f2 ff 07 00 00 00 00 00 00 0c 00 04 f2
20: ff 07 00 00 00 00 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 68 00 00 00 00 00 00 00 0b 03 00 00

60:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd TU102 USB Type-C UCSI Controller
	Flags: bus master, fast devsel, latency 0, IRQ 56, NUMA node 0
	Memory at c5084000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia-gpu
00: de 10 d7 1a 06 05 10 00 a1 00 80 0c 08 00 80 00
10: 00 40 08 c5 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 68 00 00 00 00 00 00 00 0b 04 00 00

b1:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Gigabyte Technology Co., Ltd TU102 [GeForce RTX 2080 Ti]
	Flags: bus master, fast devsel, latency 0, IRQ 177, NUMA node 1
	Memory at ed000000 (32-bit, non-prefetchable) [size=16M]
	Memory at affe0000000 (64-bit, prefetchable) [size=256M]
	Memory at afff0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	Expansion ROM at ee000000 [virtual] [disabled] [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
00: de 10 04 1e 07 05 10 00 a1 00 00 03 00 00 80 00
10: 00 00 00 ed 0c 00 00 e0 ff 0a 00 00 0c 00 00 f0
20: ff 0a 00 00 01 e0 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 60 00 00 00 00 00 00 00 0b 01 00 00

b1:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd TU102 High Definition Audio Controller
	Flags: bus master, fast devsel, latency 0, IRQ 174, NUMA node 1
	Memory at ee080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: snd_hda_intel
00: de 10 f7 10 06 01 10 00 a1 00 03 04 08 00 80 00
10: 00 00 08 ee 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 60 00 00 00 00 00 00 00 0a 02 00 00

b1:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
	Subsystem: Gigabyte Technology Co., Ltd TU102 USB 3.1 Host Controller
	Flags: fast devsel, IRQ 51, NUMA node 1
	Memory at afff2000000 (64-bit, prefetchable) [size=256K]
	Memory at afff2040000 (64-bit, prefetchable) [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: xhci_hcd
00: de 10 d6 1a 02 05 10 00 a1 30 03 0c 10 00 80 00
10: 0c 00 00 f2 ff 0a 00 00 00 00 00 00 0c 00 04 f2
20: ff 0a 00 00 00 00 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 68 00 00 00 00 00 00 00 0b 03 00 00

b1:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd TU102 USB Type-C UCSI Controller
	Flags: bus master, fast devsel, latency 0, IRQ 58, NUMA node 1
	Memory at ee084000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia-gpu
00: de 10 d7 1a 06 05 10 00 a1 00 80 0c 08 00 80 00
10: 00 40 08 ee 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 68 00 00 00 00 00 00 00 0b 04 00 00

db:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Gigabyte Technology Co., Ltd TU102 [GeForce RTX 2080 Ti]
	Flags: bus master, fast devsel, latency 0, IRQ 178, NUMA node 1
	Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
	Memory at bffe0000000 (64-bit, prefetchable) [size=256M]
	Memory at bfff0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at f000 [size=128]
	Expansion ROM at fb000000 [virtual] [disabled] [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
00: de 10 04 1e 07 05 10 00 a1 00 00 03 00 00 80 00
10: 00 00 00 fa 0c 00 00 e0 ff 0b 00 00 0c 00 00 f0
20: ff 0b 00 00 01 f0 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 60 00 00 00 00 00 00 00 0b 01 00 00

db:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd TU102 High Definition Audio Controller
	Flags: bus master, fast devsel, latency 0, IRQ 175, NUMA node 1
	Memory at fb080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: snd_hda_intel
00: de 10 f7 10 06 01 10 00 a1 00 03 04 08 00 80 00
10: 00 00 08 fb 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 60 00 00 00 00 00 00 00 0a 02 00 00

db:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
	Subsystem: Gigabyte Technology Co., Ltd TU102 USB 3.1 Host Controller
	Flags: fast devsel, IRQ 53, NUMA node 1
	Memory at bfff2000000 (64-bit, prefetchable) [size=256K]
	Memory at bfff2040000 (64-bit, prefetchable) [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: xhci_hcd
00: de 10 d6 1a 02 05 10 00 a1 30 03 0c 10 00 80 00
10: 0c 00 00 f2 ff 0b 00 00 00 00 00 00 0c 00 04 f2
20: ff 0b 00 00 00 00 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 68 00 00 00 00 00 00 00 0b 03 00 00

db:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd TU102 USB Type-C UCSI Controller
	Flags: bus master, fast devsel, latency 0, IRQ 60, NUMA node 1
	Memory at fb084000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia-gpu
00: de 10 d7 1a 06 05 10 00 a1 00 80 0c 08 00 80 00
10: 00 40 08 fb 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 58 14 d8 3f
30: 00 00 00 00 68 00 00 00 00 00 00 00 0b 04 00 00

Actually I had the 8 GPUs working before, so its supported and should work but it stopped showing the all 8 GPUs only today

That output was already part of nvidia-bug-report.log that is how I knew that only 3 devices were recognized.

If it just changed then you should check with your Server hosting service what might have changed with the setup.

If really nothing was changed (no server OS update, driver updates, BIOS changes, power support changes, power outages, etc.) then your actual HW might be affected.

As I suggested, switch out the GPUs or ideally test them in an independent system and see if maybe some are broken.