PH402 dual P100 64G RmInitAdapter failed, memory mapping issue?

First and foremost, here is my nvidia-bug-report log :

I installed my Tesla PH402 dual P100 card in a desktop I had around and I can’t get it to be loaded properly by nvidia-smi or deviceDetect.

I tried on Ubuntu 20.04 and Centos 8 with latest drivers (460 DKMS) and it nvidia-smi says no device detected when I only have the Tesla card in or only sees the other gpu that I install (GT 640).

I did a dmesg | grep ‘:03:’ and got some interesting logs where we can see that :

  1. Device is detected
  2. memory is disabled?
  3. There seems to be a problem addressing the memory of the chips
  4. RmInitAdapter failed errors, maybe due to the memory assignation errors?

I guess there is either a kernel option I am missing or it is a hardware compatibility issue. Maybe with the motherboard which is a little old (Gigabyte Z77X-UD3H)

I also tried in a more recent desktop but my Asus TUF Gaming X570 Plus refuses to boot at all, doesn’t even POST with the card plugged in. I tried updating the BIOS to no avail.

I have two of those dual P100 cards and both show the exact same behavior.

 pci 0000:03:00.0: [10de:15fa] type 00 class 0x030200
 pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
 pci 0000:03:00.0: reg 0x14: [mem 0x00000000-0x7ffffffff 64bit pref]
 pci 0000:03:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff 64bit pref]
 pci 0000:03:00.0: enabling Extended Tags
 pci 0000:03:00.0: Enabling HDA controller
 pci 0000:03:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x8 link at 0000:00:01.0 (capable of 126.016 Gb/s with 8 GT/s x16 link)
 pnp 00:00: disabling [mem 0xfed40000-0xfed44fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 system 00:03: [io  0x0454-0x0457] has been reserved
 system 00:03: Plug and Play ACPI device, IDs INT3f0d PNP0c02 (active)
 pnp 00:06: disabling [mem 0xfed1c000-0xfed1ffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xfed10000-0xfed17fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xfed18000-0xfed18fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xfed19000-0xfed19fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xf8000000-0xfbffffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xfed20000-0xfed3ffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xfed90000-0xfed93fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xfed45000-0xfed8ffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xff000000-0xffffffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xfee00000-0xfeefffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pnp 00:06: disabling [mem 0xf2000000-0xf2000fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
 pci 0000:03:00.0: BAR 1: no space for [mem size 0x800000000 64bit pref]
 pci 0000:03:00.0: BAR 1: failed to assign [mem size 0x800000000 64bit pref]
 pci 0000:03:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
 pci 0000:03:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
 pci 0000:03:00.0: BAR 0: assigned [mem 0xf2000000-0xf2ffffff]
 pci_bus 0000:03: resource 1 [mem 0xf2000000-0xf2ffffff]
 nvidia 0000:03:00.0: enabling device (0000 -> 0002)
 [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 0
 NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
 NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
 NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
 NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
 NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
 NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
 NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
 NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
 NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
 NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
 NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
 NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
 NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
 NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
 NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
 NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0

Here is the other bit of logs that may be relevant in nvidia-persistenced (03 and 04 are the p100 chips and 05 is the GT640)

systemd[1]: Starting NVIDIA Persistence Daemon...
nvidia-persistenced[1111]: Verbose syslog connection opened
nvidia-persistenced[1111]: Started (1111)
nvidia-persistenced[1111]: device 0000:03:00.0 - registered
nvidia-persistenced[1111]: device 0000:03:00.0 - failed to open.
nvidia-persistenced[1111]: device 0000:04:00.0 - registered
nvidia-persistenced[1111]: device 0000:04:00.0 - failed to open.
nvidia-persistenced[1111]: device 0000:05:00.0 - registered
nvidia-persistenced[1111]: device 0000:05:00.0 - persistence mode enabled.
nvidia-persistenced[1111]: device 0000:05:00.0 - NUMA memory onlined.
nvidia-persistenced[1111]: Local RPC services initialized
systemd[1]: Started NVIDIA Persistence Daemon.

Your system BIOS isn’t assigning device memory in a way that allows two 32 GB graphics apertures to fit in the physical address space. These GPUs are designed to go in special server-class motherboards that are tested to work in this configuration so I think you may just be out of luck trying to use them in normal desktop or workstation systems.

Please try enabling Above 4G decoding in bios, disable csm and reinstall the os in efi mode. Maybe you’re lucky, otherwise, what aplattner said.

FYI I tried booting with pci=realloc and it hangs the booting process.

I also looked for something along the lines of “Above 4G decoding” or other memory related settings but didn’t find anything in the BIOS.

I guess I have to look for a server that fits GPUs!

At the moment I have two R820 available, maybe I can make it work for testing with a PCIE riser outside the case (and a fan attached to the card of course)

Thanks for the quick replies!

I got it to work!

Enabling Above 4G Decoding on the Asus TUF Gaming X570-Plus fixed it, no luck on the old board though.

With a 3D printed fan funnel and an 80mm fan I get very decent cooling at the moment.

For future reference, the first error message I had was “No devices detected” when running nvidia-smi even though I could see the chips in lspci

Now with an above 4G Decoding capable board and enabled, CSM disabled it works fine and I can use my dual PH402 Sku 200 32GB Tesla card in a desktop.

I also had to modify the secondary EPS connector that came with the power supply so it fits in the narrow clamp of the Tesla card.

Hopefully this information will help someone in the future.