Could you help to solve the problem with driver for GPU H100 PCIe on Linux ?
I always get errors for nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Host configuration:
Debian 11 ,5.15.108-1-pve
Proxmox VE 7.4-16
Motherboard G242-Z10 (rev. 100), BIOS version:M10
AMD EPYC 7763 64-Core Processor
GPU H100 PCIe
cat /etc/modules
knem
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
cat /etc/modprobe.d/pve-blacklist.conf
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
options vfio-pci ids=10de:2331 disable_vga=1
cat /etc/modprobe.d/iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1
cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=“quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off”
lspci -nnk
81:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:2331] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1626]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
dmesg -T | grep 000:81:00
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: [10de:2331] type 00 class 0x030200
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x10: [mem 0x28042000000-0x28042ffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x18: [mem 0x24000000000-0x25fffffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x20: [mem 0x28040000000-0x28041ffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: Enabling HDA controller
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x274: [mem 0xf2000000-0xf203ffff]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: VF(n) BAR0 space: [mem 0xf2000000-0xf27fffff] (contains BAR0 for 32 VFs)
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x278: [mem 0x26000000000-0x260ffffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: VF(n) BAR1 space: [mem 0x26000000000-0x27fffffffff 64bit pref] (contains BAR1 for 32 VFs)
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x280: [mem 0x28000000000-0x28001ffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: VF(n) BAR3 space: [mem 0x28000000000-0x2803fffffff 64bit pref] (contains BAR3 for 32 VFs)
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:80:01.1 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: Adding to iommu group 55
[Wed Jul 26 07:54:59 2023] vfio-pci 0000:81:00.0: enabling device (0000 → 0002)
dmesg | grep -e DMAR -e IOMMU -e remapping
[ 0.000000] Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
[ 2.402392] pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.402409] pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.402422] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.402431] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.404796] pci 0000:c0:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.404801] pci 0000:80:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.404804] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.404807] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.404810] AMD-Vi: Interrupt remapping enabled
[ 2.405884] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[ 2.405896] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[ 2.405903] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[ 2.405909] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
VM
Machine: q35
BIOS: OVMF (UEFI)
OS: Ubuntu 20.04/22.04 ,Kernel 5.15.0-78-generic
cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=“nouveau.modeset=0 pci=realloc”
I installed drivers and get the errors
NVIDIA-Linux-x86_64-520.61.05.run
NVIDIA-Linux-x86_64-525.125.06.run
NVIDIA-Linux-x86_64-535.54.03.run
NVIDIA-Linux-x86_64-535.86.05.run
apt install nvidia-driver-535 (535.54.03)
apt install nvidia-driver-535 (535.86.05)
apt install nvidia-driver-525 (525.125.06)
ERROR
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Jul 26 16:12:49 vosk86 kernel: [ 0.620464] pci 0000:01:00.0: [10de:2331] type 00 class 0x030200
Jul 26 16:12:49 vosk86 kernel: [ 0.620633] pci 0000:01:00.0: reg 0x10: [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.620679] pci 0000:01:00.0: reg 0x18: [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.620724] pci 0000:01:00.0: reg 0x20: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.620822] pci 0000:01:00.0: Enabling HDA controller
Jul 26 16:12:49 vosk86 kernel: [ 0.621411] pci 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:00:1c.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
Jul 26 16:12:49 vosk86 kernel: [ 0.813739] pci 0000:01:00.0: can’t claim BAR 0 [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:12:49 vosk86 kernel: [ 0.813742] pci 0000:01:00.0: can’t claim BAR 2 [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:12:49 vosk86 kernel: [ 0.813744] pci 0000:01:00.0: can’t claim BAR 4 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:12:49 vosk86 kernel: [ 0.838078] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.838081] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.838084] pci 0000:01:00.0: BAR 4: assigned [mem 0xc000000000-0xc001ffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.838150] pci 0000:01:00.0: BAR 0: assigned [mem 0xc002000000-0xc002ffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 1.774305] nouveau 0000:01:00.0: enabling device (0000 → 0002)
Jul 26 16:12:49 vosk86 kernel: [ 1.776603] nouveau 0000:01:00.0: unknown chipset (ffffffff)
Jul 26 16:12:49 vosk86 kernel: [ 1.776608] nouveau: probe of 0000:01:00.0 failed with error -12
Jul 26 16:27:00 vosk86 kernel: [ 0.556820] pci 0000:01:00.0: [10de:2331] type 00 class 0x030200
Jul 26 16:27:00 vosk86 kernel: [ 0.556944] pci 0000:01:00.0: reg 0x10: [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.556991] pci 0000:01:00.0: reg 0x18: [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.557037] pci 0000:01:00.0: reg 0x20: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.557159] pci 0000:01:00.0: Enabling HDA controller
Jul 26 16:27:00 vosk86 kernel: [ 0.557762] pci 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:00:1c.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
Jul 26 16:27:00 vosk86 kernel: [ 0.763242] pci 0000:01:00.0: can’t claim BAR 0 [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:27:00 vosk86 kernel: [ 0.763247] pci 0000:01:00.0: can’t claim BAR 2 [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:27:00 vosk86 kernel: [ 0.763250] pci 0000:01:00.0: can’t claim BAR 4 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:27:00 vosk86 kernel: [ 0.790043] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.790048] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.790050] pci 0000:01:00.0: BAR 4: assigned [mem 0xc000000000-0xc001ffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.790113] pci 0000:01:00.0: BAR 0: assigned [mem 0xc002000000-0xc002ffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 4.214790] nvidia 0000:01:00.0: enabling device (0000 → 0002)
Jul 26 16:27:00 vosk86 kernel: [ 4.217328] NVRM: BAR2 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:27:00 vosk86 kernel: [ 4.217332] NVRM: BAR3 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:27:00 vosk86 kernel: [ 4.217675] NVRM: The NVIDIA GPU 0000:01:00.0
Jul 26 16:27:00 vosk86 kernel: [ 4.218145] nvidia: probe of 0000:01:00.0 failed with error -1
Jul 26 16:27:00 vosk86 kernel: [ 4.687366] NVRM: BAR2 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:27:00 vosk86 kernel: [ 4.689802] NVRM: BAR3 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:27:00 vosk86 kernel: [ 4.690253] NVRM: The NVIDIA GPU 0000:01:00.0
Jul 26 16:27:00 vosk86 kernel: [ 4.690787] nvidia: probe of 0000:01:00.0 failed with error -1
Jul 26 16:30:49 vosk86 kernel: [ 0.566179] pci 0000:01:00.0: [10de:2331] type 00 class 0x030200
Jul 26 16:30:49 vosk86 kernel: [ 0.566303] pci 0000:01:00.0: reg 0x10: [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.566465] pci 0000:01:00.0: reg 0x18: [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.566512] pci 0000:01:00.0: reg 0x20: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.566641] pci 0000:01:00.0: Enabling HDA controller
Jul 26 16:30:49 vosk86 kernel: [ 0.567245] pci 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:00:1c.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
Jul 26 16:30:49 vosk86 kernel: [ 0.760403] pci 0000:01:00.0: can’t claim BAR 0 [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:30:49 vosk86 kernel: [ 0.760406] pci 0000:01:00.0: can’t claim BAR 2 [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:30:49 vosk86 kernel: [ 0.760409] pci 0000:01:00.0: can’t claim BAR 4 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:30:49 vosk86 kernel: [ 0.784161] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.784165] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.784168] pci 0000:01:00.0: BAR 4: assigned [mem 0xc000000000-0xc001ffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.784238] pci 0000:01:00.0: BAR 0: assigned [mem 0xc002000000-0xc002ffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811045] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811048] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811051] pci 0000:01:00.0: BAR 4: no space for [mem size 0x02000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811053] pci 0000:01:00.0: BAR 4: failed to assign [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811055] pci 0000:01:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811057] pci 0000:01:00.0: BAR 0: failed to assign [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842752] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842755] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842757] pci 0000:01:00.0: BAR 4: no space for [mem size 0x02000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842760] pci 0000:01:00.0: BAR 4: failed to assign [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842762] pci 0000:01:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842764] pci 0000:01:00.0: BAR 0: failed to assign [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 4.257212] NVRM: BAR0 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:30:49 vosk86 kernel: [ 4.260728] nvidia: probe of 0000:01:00.0 failed with error -1
Jul 26 16:30:49 vosk86 kernel: [ 4.779756] NVRM: BAR0 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:30:49 vosk86 kernel: [ 4.782376] nvidia: probe of 0000:01:00.0 failed with error -1