TESLA V100S firmware issues which look like a firmware bug and comparison to working Titan Xp

We are in the process of upgrading our GPU compute station [HP Proliant 380p] from Titan XP (which worked fine) to TESLA V100S.

I hope someone can help me to better understand the issue of I/O port address mis-allocation for TESLA with BIOS during resource allocation on x86 IO bus:

TITAN Xp – working

21:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 11df
Physical Slot: 5
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 107
NUMA node: 1
Region 0: Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at de000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at 8000 [size=128]
[virtual] Expansion ROM at faf00000 [disabled] [size=512K]
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia

… startup logs…

Nov 16 07:36:57 [kraken106] kernel: pci 0000:21:00.0: [10de:1b02] type 00 class 0x030000
Nov 16 07:36:57 [kraken106] kernel: pci 0000:21:00.0: reg 0x10: [mem 0xfb000000-0xfbffffff]
Nov 16 07:36:57 [kraken106] kernel: pci 0000:21:00.0: reg 0x14: [mem 0xe0000000-0xefffffff 64bit pref]
Nov 16 07:36:57 [kraken106] kernel: pci 0000:21:00.0: reg 0x1c: [mem 0xde000000-0xdfffffff 64bit pref]
Nov 16 07:36:57 [kraken106]kernel: pci 0000:21:00.0: reg 0x24: [io 0x8000-0x807f]
Nov 16 07:36:57 [kraken106] kernel: pci 0000:21:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
Nov 16 07:36:57 [kraken106] kernel: pci 0000:21:00.1: [10de:10ef] type 00 class 0x040300
Nov 16 07:36:57 [kraken106] kernel: pci 0000:21:00.1: reg 0x10: [mem 0xfaff0000-0xfaff3fff]
Nov 16 07:36:57 [kraken106] kernel: vgaarb: device added: PCI:0000:21:00.0,decodes=io+mem,owns=none,locks=none
Nov 16 07:36:57 [kraken106] kernel: vgaarb: bridge control possible 0000:21:00.0
Nov 16 07:36:57 [kraken106] kernel: pci 0000:21:00.0: BAR 6: assigned [mem 0xfaf00000-0xfaf7ffff pref]
Nov 16 07:36:57 [kraken106] kernel: pci_bus 0000:21: resource 0 [io 0x8000-0x8fff]
Nov 16 07:36:57 [kraken106] kernel: pci_bus 0000:21: resource 1 [mem 0xfaf00000-0xfbffffff]
Nov 16 07:36:57 [kraken106] kernel: pci_bus 0000:21: resource 2 [mem 0xde000000-0xefffffff 64bit pref]

TESLA V100S – NOT working

21:00.0 3D controller: NVIDIA Corporation Device 1df6 (rev a1)
Subsystem: NVIDIA Corporation Device 13d6
Physical Slot: 5
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 106
NUMA node: 1
Region 0: Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at (64-bit, prefetchable)
Region 3: Memory at f8000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [250 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP- SDES+ TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Capabilities: [ac0 v1] #23
Kernel modules: nouveau, nvidia_drm, nvidia

… startup logs…

TESLA is ASKING for 31GB -> please note the extra f<- instead of 256MB, like Titan Xp does:
Oct 28 07:43:20 [kraken106] kernel: pci 0000:21:00.0: [10de:1df6] type 00 class 0x030200
Oct 28 07:43:20 [kraken106] kernel: pci 0000:21:00.0: reg 0x10: [mem 0xfb000000-0xfbffffff]
Oct 28 07:43:20 [ kraken106] kernel: pci 0000:21:00.0: reg 0x14: [mem 0x00000000-0x7ffffffff 64bit pref]
Oct 28 07:43:20 [kraken106] kernel: pci 0000:21:00.0: reg 0x1c: [mem 0xf8000000-0xf9ffffff 64bit pref]

ZOOMING IN and splitting hex digits in the line where I/O allocation failed with | for posterity:

[mem 0x|0000|0000-0x|7fff|ffff|f 64bit pref] – there is an extra f in the address, even when comparing with the first part of the number which is 8 chars, while second part is 9 chars! Really looks like a bug!

AFAIK the BIOS I/O addresses are to be stored in 32-bit registers in the system, so these can only be 32-bit numbers, but 0x7ffffffff is 34-bit number! (due to that extra f at the end).

See this for more references: https://wiki.osdev.org/PCI#Memory_Mapped_PCI_Configuration_Space_Access

So, the issue is most likely with the firmware of the TITAN card not being entirely compatible with our system.
AFAIK, it should not be compatible with any system because the I/O registers are 32-bit, not 34 or 64 bits, unless my age is showing up and I do not know something about the latest hardware developments. :-)