First and foremost, here is my nvidia-bug-report log :
I installed my Tesla PH402 dual P100 card in a desktop I had around and I can’t get it to be loaded properly by nvidia-smi or deviceDetect.
I tried on Ubuntu 20.04 and Centos 8 with latest drivers (460 DKMS) and it nvidia-smi says no device detected
when I only have the Tesla card in or only sees the other gpu that I install (GT 640).
I did a dmesg | grep ‘:03:’ and got some interesting logs where we can see that :
- Device is detected
- memory is disabled?
- There seems to be a problem addressing the memory of the chips
- RmInitAdapter failed errors, maybe due to the memory assignation errors?
I guess there is either a kernel option I am missing or it is a hardware compatibility issue. Maybe with the motherboard which is a little old (Gigabyte Z77X-UD3H)
I also tried in a more recent desktop but my Asus TUF Gaming X570 Plus refuses to boot at all, doesn’t even POST with the card plugged in. I tried updating the BIOS to no avail.
I have two of those dual P100 cards and both show the exact same behavior.
pci 0000:03:00.0: [10de:15fa] type 00 class 0x030200
pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
pci 0000:03:00.0: reg 0x14: [mem 0x00000000-0x7ffffffff 64bit pref]
pci 0000:03:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff 64bit pref]
pci 0000:03:00.0: enabling Extended Tags
pci 0000:03:00.0: Enabling HDA controller
pci 0000:03:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x8 link at 0000:00:01.0 (capable of 126.016 Gb/s with 8 GT/s x16 link)
pnp 00:00: disabling [mem 0xfed40000-0xfed44fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
system 00:03: [io 0x0454-0x0457] has been reserved
system 00:03: Plug and Play ACPI device, IDs INT3f0d PNP0c02 (active)
pnp 00:06: disabling [mem 0xfed1c000-0xfed1ffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xfed10000-0xfed17fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xfed18000-0xfed18fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xfed19000-0xfed19fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xf8000000-0xfbffffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xfed20000-0xfed3ffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xfed90000-0xfed93fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xfed45000-0xfed8ffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xff000000-0xffffffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xfee00000-0xfeefffff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pnp 00:06: disabling [mem 0xf2000000-0xf2000fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x7ffffffff 64bit pref]
pci 0000:03:00.0: BAR 1: no space for [mem size 0x800000000 64bit pref]
pci 0000:03:00.0: BAR 1: failed to assign [mem size 0x800000000 64bit pref]
pci 0000:03:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
pci 0000:03:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
pci 0000:03:00.0: BAR 0: assigned [mem 0xf2000000-0xf2ffffff]
pci_bus 0000:03: resource 1 [mem 0xf2000000-0xf2ffffff]
nvidia 0000:03:00.0: enabling device (0000 -> 0002)
[drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 0
NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:624)
NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
Here is the other bit of logs that may be relevant in nvidia-persistenced (03 and 04 are the p100 chips and 05 is the GT640)
systemd[1]: Starting NVIDIA Persistence Daemon...
nvidia-persistenced[1111]: Verbose syslog connection opened
nvidia-persistenced[1111]: Started (1111)
nvidia-persistenced[1111]: device 0000:03:00.0 - registered
nvidia-persistenced[1111]: device 0000:03:00.0 - failed to open.
nvidia-persistenced[1111]: device 0000:04:00.0 - registered
nvidia-persistenced[1111]: device 0000:04:00.0 - failed to open.
nvidia-persistenced[1111]: device 0000:05:00.0 - registered
nvidia-persistenced[1111]: device 0000:05:00.0 - persistence mode enabled.
nvidia-persistenced[1111]: device 0000:05:00.0 - NUMA memory onlined.
nvidia-persistenced[1111]: Local RPC services initialized
systemd[1]: Started NVIDIA Persistence Daemon.