7 out of 8 GPUs on G291 with Debian 10

Running Debian 10 nvidia-smi (version 430.26) sees only 7 out of 8 RTX-2080 Ti GPUs. And the following lines are in dmesg:

[ 5068.393890] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:06:00.0)
[ 5068.393893] NVRM: The system BIOS may have misconfigured your GPU.
[ 5068.393913] nvidia: probe of 0000:06:00.0 failed with error -1
[ 5068.393935] NVRM: The NVIDIA probe routine failed for 1 device(s).

Btw. I have two of these machines and on both it’s the same PCI:0000:06:00.0.

Hardware info:
description: Rack Mount Chassis
product: G291-Z20-00 (01234567890123456789AB)
vendor: GIGABYTE
version: 0100
serial: GJG1N3421A0005
width: 64 bits
capabilities: smbios-3.1.1 dmi-3.1.1 smp vsyscall32
configuration: chassis=rackmount family=Server sku=01234567890123456789AB uuid=008093DB-74FD-E711-8000-B42E992DDD78
*-core
description: Motherboard
product: MZ21-G20-00
vendor: GIGABYTE
physical id: 0
version: 01000100
serial: JH1N3400011
slot: 01234567890123456789AB
*-firmware
description: BIOS
vendor: GIGABYTE
physical id: 0
version: F03
date: 03/29/2019
size: 64KiB
capacity: 16MiB
capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int14serial int17printer acpi usb biosbootspecification uefi
*-cpu
description: CPU
product: AMD EPYC 7351P 16-Core Processor
vendor: Advanced Micro Devices [AMD]
physical id: 47
bus info: cpu@0
version: AMD EPYC 7351P 16-Core Processor
serial: Unknown
slot: P0
size: 1173MHz
capacity: 2900MHz
width: 64 bits
clock: 100MHz

Could you help me to find the eighth GPU?
Many thanks and best regards!

lspci |grep VGA|grep NVIDIA
05:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
27:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
28:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
43:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
44:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
63:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
64:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)

nvidia-smi
Tue Jul 2 22:33:53 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… Off | 00000000:05:00.0 Off | N/A |
| 0% 34C P0 44W / 250W | 0MiB / 11019MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… Off | 00000000:27:00.0 Off | N/A |
| 0% 33C P0 50W / 250W | 0MiB / 11019MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce RTX 208… Off | 00000000:28:00.0 Off | N/A |
| 0% 33C P0 52W / 250W | 0MiB / 11019MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 208… Off | 00000000:43:00.0 Off | N/A |
| 0% 32C P0 56W / 250W | 0MiB / 11019MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 GeForce RTX 208… Off | 00000000:44:00.0 Off | N/A |
| 0% 33C P0 49W / 250W | 0MiB / 11019MiB | 1% Default |
±------------------------------±---------------------±---------------------+
| 5 GeForce RTX 208… Off | 00000000:63:00.0 Off | N/A |
| 0% 34C P0 43W / 250W | 0MiB / 11019MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 6 GeForce RTX 208… Off | 00000000:64:00.0 Off | N/A |
| 0% 35C P0 1W / 250W | 0MiB / 11019MiB | 0% Default |
±------------------------------±---------------------±---------------------+
nvidia-bug-report.log.gz (2.4 MB)

Please check your bios for an option like
above 4G decoding
large/64bit BARs
and enable it.
Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

“above 4G decoding” is enabled and I attached the nvidia-bug-report.log.gz file to my previous post. I also disabled “IOMMU function” due to a hint from Gigabyte.

Looks like a bios bug:

[    0.221562] pci 0000:06:00.0: BAR 1: no space for [mem size 0x10000000 64bit pref]
[    0.221564] pci 0000:06:00.0: BAR 1: trying firmware assignment [mem 0x1ffe0000000-0x1ffefffffff 64bit pref]
[    0.221565] pci 0000:06:00.0: BAR 1: [mem 0x1ffe0000000-0x1ffefffffff 64bit pref] conflicts with PCI Bus 0000:00 [mem 0x191d4000000-0x1ffffffffff window]
[    0.221566] pci 0000:06:00.0: BAR 1: failed to assign [mem size 0x10000000 64bit pref]
[    0.221568] pci 0000:06:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[    0.221569] pci 0000:06:00.0: BAR 3: trying firmware assignment [mem 0x1fff0000000-0x1fff1ffffff 64bit pref]
[    0.221570] pci 0000:06:00.0: BAR 3: [mem 0x1fff0000000-0x1fff1ffffff 64bit pref] conflicts with PCI Bus 0000:00 [mem 0x191d4000000-0x1ffffffffff window]
[    0.221571] pci 0000:06:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]

Please try with kernel parameters
pci=realloc
or
pci=nocrs

with pci=realloc only 6 out of 8 (2_nvidia-bug-report.log.gz)

[ 16.539286] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 16.550782] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 16.551761] nvidia 0000:05:00.0: enabling device (0000 -> 0002)
[ 16.551871] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:05:00.0)
[ 16.551872] NVRM: The system BIOS may have misconfigured your GPU.
[ 16.551886] nvidia: probe of 0000:05:00.0 failed with error -1
[ 16.551918] nvidia 0000:06:00.0: enabling device (0000 -> 0002)
[ 16.551957] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:06:00.0)
[ 16.551958] NVRM: The system BIOS may have misconfigured your GPU.
[ 16.551966] nvidia: probe of 0000:06:00.0 failed with error -1

with pci=nocrs no network and 7 out of 8 (3_nvidia-bug-report.log.gz)
3_nvidia-bug-report.log.gz (3.92 MB)
2_nvidia-bug-report.log.gz (3.51 MB)

No dice. It’s just not fitting in the space the bios reserved and marked prefetchable since all the system devices are also in that space under root bridge 0000:00 where the gpus 0000:06 and 0000:05 reside.
Looking at the bios history, you’re currently on the latest version F3 where Gigabyte already increased pci resources to enable 4 RTX to run. So they will have to increase it another time to get 8 RTX to run.

Gigabit told me, that the configuration is running e.g. with CentOS. So why not with Debian?

It’s a low-level resource issue, so unless redhat have patched their kernel for this, I wouldn’t believe it without seeing. You could check if a newer kernel like 5.1 have some kind of mitigation.

I’m already compiling 5.1.

Kernel 5.1.11 finds also 7 out of 8!

[ 269.328609] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 269.329433] nvidia 0000:05:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 269.429329] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:06:00.0)
[ 269.429331] NVRM: The system BIOS may have misconfigured your GPU.
[ 269.429352] nvidia: probe of 0000:06:00.0 failed with error -1
[ 269.429558] nvidia 0000:27:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 269.529456] nvidia 0000:28:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 269.629291] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 269.729316] nvidia 0000:44:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 269.829176] nvidia 0000:63:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 269.929196] nvidia 0000:64:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 270.028908] NVRM: The NVIDIA probe routine failed for 1 device(s).

Looking a bit more deeply at this, it looks like a plain bios bug, beginning earlier:
the root bridge 0000:00 has enough space

pci_bus 0000:00: root bus resource [mem 0x191d4000000-0x1ffffffffff window]

but the downstream switch only gets assigned 32bit resources

[    0.158554] pci 0000:00:03.1: PCI bridge to [bus 03-06]
[    0.158557] pci 0000:00:03.1:   bridge window [io  0x2000-0x3fff]
[    0.158559] pci 0000:00:03.1:   bridge window [mem 0xec000000-0xef1fffff]
[    0.158562] pci 0000:00:03.1:   bridge window [mem 0xd0000000-0xe20fffff 64bit pref]

which is only enough for one gpu.
on the other root busses 20,40,60, the 64bit window gets assigned to the downstream bridges, e.g.:

[    0.163044] pci_bus 0000:20: root bus resource [mem 0x123a8000000-0x191d3ffffff window]

[    0.222106] pci 0000:25:00.0: PCI bridge to [bus 26-28]
[    0.222107] pci 0000:25:00.0:   bridge window [io  0x5000-0x6fff]
[    0.222110] pci 0000:25:00.0:   bridge window [mem 0xc7000000-0xca0fffff]
[    0.222112] pci 0000:25:00.0:   bridge window [mem 0x19190000000-0x191c20fffff 64bit pref]

new comment from Gigabyte:

Since G291-Z20 BIOS F03 , already increase PCIe resource.
No other BIOS solution for this case. ( 8 GPU on CentOS 7.6 OK, But Debian Failed )

But:

I installed CentOS Linux release 7.6.1810 (Core)
Linux gpu07 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi
Fri Jul 5 10:08:01 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… Off | 00000000:05:00.0 Off | N/A |
| 0% 40C P0 45W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… Off | 00000000:27:00.0 Off | N/A |
| 0% 39C P0 62W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce RTX 208… Off | 00000000:28:00.0 Off | N/A |
| 0% 38C P0 65W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 208… Off | 00000000:43:00.0 Off | N/A |
| 0% 38C P0 55W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 GeForce RTX 208… Off | 00000000:44:00.0 Off | N/A |
| 0% 37C P0 42W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 5 GeForce RTX 208… Off | 00000000:63:00.0 Off | N/A |
| 0% 42C P0 65W / 250W | 0MiB / 10989MiB | 1% Default |
±------------------------------±---------------------±---------------------+
| 6 GeForce RTX 208… Off | 00000000:64:00.0 Off | N/A |
| 0% 41C P0 68W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

[ 69.614998] nvidia 0000:05:00.0: irq 396 for MSI/MSI-X
[ 71.522544] nvidia 0000:27:00.0: irq 397 for MSI/MSI-X
[ 73.187843] nvidia 0000:28:00.0: irq 398 for MSI/MSI-X
[ 74.437220] nvidia 0000:43:00.0: irq 399 for MSI/MSI-X
[ 76.155431] nvidia 0000:44:00.0: irq 400 for MSI/MSI-X
[ 77.413776] nvidia 0000:63:00.0: irq 401 for MSI/MSI-X
[ 79.175896] nvidia 0000:64:00.0: irq 402 for MSI/MSI-X

not even with the newest driver:

nvidia-smi
Fri Jul 5 10:27:53 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… Off | 00000000:05:00.0 Off | N/A |
| 0% 39C P0 44W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… Off | 00000000:27:00.0 Off | N/A |
| 0% 37C P0 61W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce RTX 208… Off | 00000000:28:00.0 Off | N/A |
| 0% 36C P0 64W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 208… Off | 00000000:43:00.0 Off | N/A |
| 0% 37C P0 55W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 GeForce RTX 208… Off | 00000000:44:00.0 Off | N/A |
| 0% 36C P0 41W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 5 GeForce RTX 208… Off | 00000000:63:00.0 Off | N/A |
| 0% 40C P0 63W / 250W | 0MiB / 10989MiB | 1% Default |
±------------------------------±---------------------±---------------------+
| 6 GeForce RTX 208… Off | 00000000:64:00.0 Off | N/A |
| 0% 39C P0 67W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

nvidia-bug-report.log.gz (3.3 MB)

So it isn’t working with the Centos kernel, either.

Have you tried switching the non-starting GPU with another?, if so it’s time to RMA the G291-Z20 back to Gigabyte.

After the last answer from Gigabyte:

The RD team try added PCIe resources to the max, please download the test BIOS F5a from FTP link.
But RD still not duplicate the any lose GPU problem.

F05a 64bit MMIO resource : 0x2ffffffffff - 0x25590000000 = AA6FFFFFFF
[ 1.088547] pci_bus 0000:00: root bus resource [io 0x0000-0x02ff window]
[ 1.095332] pci_bus 0000:00: root bus resource [io 0x0300-0x03af window]
[ 1.102119] pci_bus 0000:00: root bus resource [io 0x03e0-0x0cf7 window]
[ 1.108904] pci_bus 0000:00: root bus resource [io 0x0d00-0x3fff window]
[ 1.115690] pci_bus 0000:00: root bus resource [mem 0x000c0000-0x000dffff window]
[ 1.123171] pci_bus 0000:00: root bus resource [mem 0xe8000000-0xefffffff window]
[ 1.130651] pci_bus 0000:00: root bus resource [mem 0x25590000000-0x2ffffffffff window]
[ 1.138654] pci_bus 0000:00: root bus resource [bus 00-1f]

Legacy still not working, but now nvida-smi on Debian 10 UEFI finds all 8 of 8 GPUs.

The root bus was never the provlem but the downstream bus

[    0.158554] pci 0000:00:03.1: PCI bridge to [bus 03-06]
[    0.158557] pci 0000:00:03.1:   bridge window [io  0x2000-0x3fff]
[    0.158559] pci 0000:00:03.1:   bridge window [mem 0xec000000-0xef1fffff]
[    0.158562] pci 0000:00:03.1:   bridge window [mem 0xd0000000-0xe20fffff 64bit pref]

Did GB add new resources to this (in UEFI mode)?

That’s the/one change of the second update, I only got this bit of information. UEFI is Ok, Legacy is still a problem.

I can confirm that this problem goes away when using UEFI mode (on CentOS).