Tesla K80 detected on OpenSuse 15.5, but nvidia-smi couldn't communicate with the NVIDIA driver

ben.kinigadner · June 6, 2023, 8:29pm

Tesla K80 detected on OpenSuse 15.5, but nvidia-smi couldn’t communicate with the NVIDIA driver.

I understand that I am running the Tesla K80 not in its preferred environment, but I have seen many regular consumers use this product successfully so I tried it myself.
While it seems like the drivers have been installed successfully, the drivers have not been loaded, or that’s how I am interpreting this output:

> lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
    Subsystem: ASUSTeK Computer Inc. Device 8534
    Kernel driver in use: i915
--
03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
    Subsystem: NVIDIA Corporation Device 106c
    Kernel modules: nouveau, nvidia_drm, nvidia
04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
    Subsystem: NVIDIA Corporation Device 106c
    Kernel modules: nouveau, nvidia_drm, nvidia

> nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

> cat /etc/modprobe.d/50-blacklist.conf
blacklist nouveau

I am doing everything through tty and ssh so there is no xorg
Above 4G decoding enabled, Secureboot disabled
Cooling:
3D Printed holder for two 40mm noctua delta running at 9000 rpm. (Yes it sounds like an aircraft)
Installation with this: download/driverResults.aspx/200643/en-us/ (added to nvidias url)

I looked a bit around in the bug report file and I think the rest of the relevant information can be found there.
nvidia-bug-report.log.gz (224.9 KB)

I only have one idea left what the problem could be, but this would require a new investment, so I want to test any other options beforehand.
I calculated that my setup, without the GPU, would take on full workload(I am not overclocking) 100-160W. I think I have never actually used it in full workload, only for my little experiments and nextcloud. The GPU would need 300W so I thought a 500W would be enough. Sinc the PSU only had one CPU cable, I bought an adapter for two PCIe cables. Ironically the two PCIe Cables are splitted from the same cable and thus might still be limited to 150W. If this is the case, is there a way to disable the second GPU on the card, so I can use the first one, until I get a new PSU.

Mart · June 6, 2023, 8:40pm

There are PCI resource conflicts:

[ 0.245237] pci 0000:00:01.0: BAR 15: no space for [mem size 0xc00000000 64bit pref]
[ 0.245240] pci 0000:00:01.0: BAR 15: failed to assign [mem size 0xc00000000 64bit pref]
[ 0.245242] pci 0000:01:00.0: BAR 15: no space for [mem size 0xc00000000 64bit pref]
[ 0.245243] pci 0000:01:00.0: BAR 15: failed to assign [mem size 0xc00000000 64bit pref]
[ 0.245245] pci 0000:01:00.0: BAR 14: no space for [mem size 0x02000000]
[ 0.245245] pci 0000:01:00.0: BAR 14: failed to assign [mem size 0x02000000]
[ 0.245247] pci 0000:02:08.0: BAR 15: no space for [mem size 0x600000000 64bit pref]
[ 0.245248] pci 0000:02:08.0: BAR 15: failed to assign [mem size 0x600000000 64bit pref]
[ 0.245249] pci 0000:02:10.0: BAR 15: no space for [mem size 0x600000000 64bit pref]
[ 0.245250] pci 0000:02:10.0: BAR 15: failed to assign [mem size 0x600000000 64bit pref]
[ 0.245251] pci 0000:02:08.0: BAR 14: no space for [mem size 0x01000000]
[ 0.245252] pci 0000:02:08.0: BAR 14: failed to assign [mem size 0x01000000]
[ 0.245253] pci 0000:02:10.0: BAR 14: no space for [mem size 0x01000000]
[ 0.245254] pci 0000:02:10.0: BAR 14: failed to assign [mem size 0x01000000]
[ 0.245255] pci 0000:03:00.0: BAR 1: no space for [mem size 0x400000000 64bit pref]
[ 0.245256] pci 0000:03:00.0: BAR 1: failed to assign [mem size 0x400000000 64bit pref]
[ 0.245257] pci 0000:03:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[ 0.245258] pci 0000:03:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
[ 0.245259] pci 0000:03:00.0: BAR 0: no space for [mem size 0x01000000]
[ 0.245260] pci 0000:03:00.0: BAR 0: failed to assign [mem size 0x01000000]
[ 0.245261] pci 0000:02:08.0: PCI bridge to [bus 03]
[ 0.245269] pci 0000:04:00.0: BAR 1: no space for [mem size 0x400000000 64bit pref]
[ 0.245270] pci 0000:04:00.0: BAR 1: failed to assign [mem size 0x400000000 64bit pref]
[ 0.245272] pci 0000:04:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[ 0.245273] pci 0000:04:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
[ 0.245274] pci 0000:04:00.0: BAR 0: no space for [mem size 0x01000000]
[ 0.245275] pci 0000:04:00.0: BAR 0: failed to assign [mem size 0x01000000]
[ 0.245276] pci 0000:02:10.0: PCI bridge to [bus 04]
[ 0.245282] pci 0000:01:00.0: PCI bridge to [bus 02-04]
[ 0.245288] pci 0000:00:01.0: PCI bridge to [bus 01-04]
[ 0.245290] pci 0000:00:01.0: bridge window [mem 0xf7d00000-0xf7dfffff]
[ 0.245294] pci 0000:00:1c.0: PCI bridge to [bus 05]
[ 0.245302] pci 0000:00:1c.2: PCI bridge to [bus 06]
[ 0.245303] pci 0000:00:1c.2: bridge window [io 0xe000-0xefff]
[ 0.245306] pci 0000:00:1c.2: bridge window [mem 0xf7c00000-0xf7cfffff]
[ 0.245309] pci 0000:00:1c.2: bridge window [mem 0xf0000000-0xf00fffff 64bit pref]
[ 0.245313] pci 0000:07:00.0: PCI bridge to [bus 08]
[ 0.245328] pci 0000:00:1c.3: PCI bridge to [bus 07-08]
[ 0.245336] pci_bus 0000:00: Some PCI device resources are unassigned, try booting with pci=realloc

Give it a shot and do what the log file says. Boot with kernel parameter pci=realloc

Mart · June 6, 2023, 8:45pm

Also look for a BIOS update:

BIOS Information
Vendor: American Megatrends Inc.
Version: 2902
Release Date: 03/31/2016

is very old.

ben.kinigadner · June 7, 2023, 7:45pm

I tried pci=realloc and pci=realloc=off but I could not detect changes.
logs →

ACPI Warning: SystemIO range 0x0000000000001828-0x000000000000182F conflicts with OpRegion 0x0000000000001800-0x000000000000187F (\PMIO) (20210730/utaddress-213)
[	3.904360] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[	3.905864] ACPI Warning: SystemIO range 0x0000000000001C40-0x0000000000001C4F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C7F (\_GPE.GPBX) (20210730/utaddress-213)
[	3.907424] ACPI Warning: SystemIO range 0x0000000000001C40-0x0000000000001C4F conflicts with OpRegion 0x0000000000001C00-0x0000000000001FFF (\GPR) (20210730/utaddress-213)
[	3.908986] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[	3.910532] ACPI Warning: SystemIO range 0x0000000000001C30-0x0000000000001C3F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C7F (\_GPE.GPBX) (20210730/utaddress-213)
[	3.910644] input: PC Speaker as /devices/platform/pcspkr/input/input8
[	3.912148] ACPI Warning: SystemIO range 0x0000000000001C30-0x0000000000001C3F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C3F (\GPRL) (20210730/utaddress-213)
[	3.912153] ACPI Warning: SystemIO range 0x0000000000001C30-0x0000000000001C3F conflicts with OpRegion 0x0000000000001C00-0x0000000000001FFF (\GPR) (20210730/utaddress-213)
[	3.912156] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[	3.912158] ACPI Warning: SystemIO range 0x0000000000001C00-0x0000000000001C2F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C7F (\_GPE.GPBX) (20210730/utaddress-213)
[	3.912162] ACPI Warning: SystemIO range 0x0000000000001C00-0x0000000000001C2F conflicts with OpRegion 0x0000000000001C00-0x0000000000001C3F (\GPRL) (20210730/utaddress-213)
[	3.912165] ACPI Warning: SystemIO range 0x0000000000001C00-0x0000000000001C2F conflicts with OpRegion 0x0000000000001C00-0x0000000000001FFF (\GPR) (20210730/utaddress-213)

On the Linux mailing list they sad:

Unless you need to use anything on SMBus (hardware sensors, essentially)
you don’t have to worry about that one. It means that the kernel has
detected that the BIOS may potentially access the SMBus controller which
may conflict with usage of the controller from within the OS.

Could this be a problem, because the K80 uses temperature sensors, and could shut down if above 95 Celsius and so maybe completely shut down if the sensors can’t read?

logs →
nvidia: module verification failed: signature and/or required key missing - tainting kernel

~> mokutil --sb-state
SecureBoot disabled
Platform is in Setup Mode

logs →

2023-06-05T22:28:44.104044+02:00 localhost kernel: [   46.951725][T13740] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
2023-06-05T22:28:44.104045+02:00 localhost kernel: [   46.951729][T13740] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2023-06-05T22:28:44.104046+02:00 localhost kernel: [   46.951729][T13740] NVRM: BAR0 is 0M @ 0x0 (PCI:0000:03:00.0)
2023-06-05T22:28:44.104046+02:00 localhost kernel: [   46.952168][T13740] NVRM: The system BIOS may have misconfigured your GPU.
2023-06-05T22:28:44.104048+02:00 localhost kernel: [   46.952218][T13740] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2023-06-05T22:28:44.104048+02:00 localhost kernel: [   46.952218][T13740] NVRM: BAR0 is 0M @ 0x0 (PCI:0000:04:00.0)
2023-06-05T22:28:44.104049+02:00 localhost kernel: [   46.952220][T13740] NVRM: The system BIOS may have misconfigured your GPU.
2023-06-05T22:28:44.104050+02:00 localhost kernel: [   46.952237][T13740] NVRM: The NVIDIA probe routine failed for 2 device(s).
2023-06-05T22:28:44.104050+02:00 localhost kernel: [   46.952239][T13740] NVRM: None of the NVIDIA devices were initialized.
2023-06-05T22:28:44.104050+02:00 localhost kernel: [   46.952395][T13740] nvidia-nvlink: Unregistered the Nvlink Core, major device number 238
2023-06-05T22:28:52.052043+02:00 localhost kernel: [   54.898702][T20243] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
2023-06-05T22:28:52.052044+02:00 localhost kernel: [   54.898705][T20243] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2023-06-05T22:28:52.052044+02:00 localhost kernel: [   54.898705][T20243] NVRM: BAR0 is 0M @ 0x0 (PCI:0000:03:00.0)
2023-06-05T22:28:52.052045+02:00 localhost kernel: [   54.899029][T20243] NVRM: The system BIOS may have misconfigured your GPU.
2023-06-05T22:28:52.052047+02:00 localhost kernel: [   54.899042][T20243] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2023-06-05T22:28:52.052047+02:00 localhost kernel: [   54.899042][T20243] NVRM: BAR0 is 0M @ 0x0 (PCI:0000:04:00.0)
2023-06-05T22:28:52.052047+02:00 localhost kernel: [   54.899044][T20243] NVRM: The system BIOS may have misconfigured your GPU.
2023-06-05T22:28:52.052048+02:00 localhost kernel: [   54.899056][T20243] NVRM: The NVIDIA probe routine failed for 2 device(s).
2023-06-05T22:28:52.052048+02:00 localhost kernel: [   54.899057][T20243] NVRM: None of the NVIDIA devices were initialized.
2023-06-05T22:28:52.052049+02:00 localhost kernel: [   54.899180][T20243] nvidia-nvlink: Unregistered the Nvlink Core, major device number 238

This sounds like a similar issue like this

/tesla-k80-installation-issue/110336

My Bios sadly can’t modify my MMIOBase, so I installed uefi shell on an usbstick an ran some commands

Here is something from memmap:

Type   	Start        	End          	# Pages      	Attributes
MMIO   	00000000F8000000-00000000FBFFFFFF 0000000000004000 8000000000000001
MMIO   	00000000FEC00000-00000000FEC00FFF 0000000000000001 8000000000000001
MMIO   	00000000FED00000-00000000FED03FFF 0000000000000004 8000000000000001
MMIO   	00000000FED1C000-00000000FED1FFFF 0000000000000004 8000000000000001
MMIO   	00000000FEE00000-00000000FEE00FFF 0000000000000001 8000000000000001
MMIO   	00000000FF000000-00000000FFFFFFFF 0000000000001000 8000000000000001

I don’t know how but maybe it is possible to see if my MMIOBase is under 42 bit. I just think the start values are quite massive.

Chatgpt thought that dmp store is useful, but I could not get any information out of it.
dmpstore2.txt (127.2 KB)

ben.kinigadner · June 7, 2023, 7:54pm

Here is also the pci command
pci.txt (5.5 KB)

ben.kinigadner · June 8, 2023, 9:41am

logs →

/sbin/lspci -d "10de:*" -v -xxx

03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation Device 106c
	Flags: fast devsel, IRQ 16
	Memory at <unassigned> (64-bit, prefetchable) [disabled]
	Memory at <unassigned> (64-bit, prefetchable) [disabled]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Kernel modules: nouveau, nvidia_drm, nvidia

The problem really seems to be the Memory assignment

ben.kinigadner · June 8, 2023, 3:42pm

After looking through dmesg I found this:

[    3.369796] raid6: avx2x4   gen() 17284 MB/s
[    3.418833] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[    3.418847] NVRM: request_mem_region failed for 0M @ 0x0. This can
               NVRM: occur when a driver such as rivatv is loaded and claims
               NVRM: ownership of the device's registers.
[    3.420917] nvidia: probe of 0000:03:00.0 failed with error -1
[    3.420967] NVRM: request_mem_region failed for 0M @ 0x0. This can
               NVRM: occur when a driver such as rivatv is loaded and claims
               NVRM: ownership of the device's registers.
[    3.420969] nvidia: probe of 0000:04:00.0 failed with error -1

I also found I think the thing that claims the memory

[    0.193803] pnp 00:00: disabling [mem 0xfed40000-0xfed44fff] because it overlaps 0000:03:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]
[    0.193803] pnp 00:00: disabling [mem 0xfed40000-0xfed44fff disabled] because it overlaps 0000:04:00.0 BAR 1 [mem 0x00000000-0x3ffffffff 64bit pref]

I think I will open a new issue. because memory assignment is far off the original title

ben.kinigadner · June 18, 2023, 5:42am

I upgraded my memory from 8Gib to 28GiB, started with pci=realloc, pci=nocrs, and sudo modprobe nvidia

system · July 2, 2023, 5:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid Linux	39	16839	October 12, 2021
Tesla K80 Installation Issue CUDA Setup and Installation	2	1911	August 31, 2020
Driver Installation for Tesla K80 - Problems CUDA Setup and Installation	17	6248	January 18, 2020
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62123	February 14, 2021
RmInitAdapter failed! since kernel > 6.4 Linux kernel	28	3123	November 5, 2024
Broken GPU state query failure in AMD + H100 Confidential Computing	10	944	February 15, 2024
Error when installing nvidia driver - Tesla K40m on Linux RHEL Linux	28	2659	October 12, 2021
[370.28] with kernel [4.8] on >=2015 machines: driver claims card not supported if nvidia is not primary card Linux	37	21398	September 26, 2017
Tesla P40 in Dell Percision 7910 rack CUDA Programming and Performance	10	2038	February 16, 2024
2 Tesla C1060s with a legacy GeForce FX 5200 card Need help editing the xorg.conf file for multiple CUDA Programming and Performance	28	35534	January 29, 2009

Tesla K80 detected on OpenSuse 15.5, but nvidia-smi couldn't communicate with the NVIDIA driver

Related topics