H100 PCIe, NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

Could you help to solve the problem with driver for GPU H100 PCIe on Linux ?
I always get errors for nvidia-smi

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Host configuration:
Debian 11 ,5.15.108-1-pve
Proxmox VE 7.4-16
Motherboard G242-Z10 (rev. 100), BIOS version:M10
AMD EPYC 7763 64-Core Processor
GPU H100 PCIe

cat /etc/modules
knem
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

cat /etc/modprobe.d/pve-blacklist.conf
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
options vfio-pci ids=10de:2331 disable_vga=1

cat /etc/modprobe.d/iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1

cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=“quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off”

lspci -nnk
81:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:2331] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1626]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

dmesg -T | grep 000:81:00
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: [10de:2331] type 00 class 0x030200
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x10: [mem 0x28042000000-0x28042ffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x18: [mem 0x24000000000-0x25fffffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x20: [mem 0x28040000000-0x28041ffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: Enabling HDA controller
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x274: [mem 0xf2000000-0xf203ffff]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: VF(n) BAR0 space: [mem 0xf2000000-0xf27fffff] (contains BAR0 for 32 VFs)
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x278: [mem 0x26000000000-0x260ffffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: VF(n) BAR1 space: [mem 0x26000000000-0x27fffffffff 64bit pref] (contains BAR1 for 32 VFs)
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: reg 0x280: [mem 0x28000000000-0x28001ffffff 64bit pref]
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: VF(n) BAR3 space: [mem 0x28000000000-0x2803fffffff 64bit pref] (contains BAR3 for 32 VFs)
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:80:01.1 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[Wed Jul 26 07:46:48 2023] pci 0000:81:00.0: Adding to iommu group 55
[Wed Jul 26 07:54:59 2023] vfio-pci 0000:81:00.0: enabling device (0000 → 0002)

dmesg | grep -e DMAR -e IOMMU -e remapping
[ 0.000000] Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
[ 2.402392] pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.402409] pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.402422] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.402431] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.404796] pci 0000:c0:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.404801] pci 0000:80:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.404804] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.404807] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.404810] AMD-Vi: Interrupt remapping enabled
[ 2.405884] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[ 2.405896] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[ 2.405903] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[ 2.405909] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).

VM
Machine: q35
BIOS: OVMF (UEFI)
OS: Ubuntu 20.04/22.04 ,Kernel 5.15.0-78-generic

cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=“nouveau.modeset=0 pci=realloc”

I installed drivers and get the errors
NVIDIA-Linux-x86_64-520.61.05.run
NVIDIA-Linux-x86_64-525.125.06.run
NVIDIA-Linux-x86_64-535.54.03.run
NVIDIA-Linux-x86_64-535.86.05.run
apt install nvidia-driver-535 (535.54.03)
apt install nvidia-driver-535 (535.86.05)
apt install nvidia-driver-525 (525.125.06)

ERROR
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Jul 26 16:12:49 vosk86 kernel: [ 0.620464] pci 0000:01:00.0: [10de:2331] type 00 class 0x030200
Jul 26 16:12:49 vosk86 kernel: [ 0.620633] pci 0000:01:00.0: reg 0x10: [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.620679] pci 0000:01:00.0: reg 0x18: [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.620724] pci 0000:01:00.0: reg 0x20: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.620822] pci 0000:01:00.0: Enabling HDA controller
Jul 26 16:12:49 vosk86 kernel: [ 0.621411] pci 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:00:1c.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
Jul 26 16:12:49 vosk86 kernel: [ 0.813739] pci 0000:01:00.0: can’t claim BAR 0 [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:12:49 vosk86 kernel: [ 0.813742] pci 0000:01:00.0: can’t claim BAR 2 [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:12:49 vosk86 kernel: [ 0.813744] pci 0000:01:00.0: can’t claim BAR 4 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:12:49 vosk86 kernel: [ 0.838078] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.838081] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.838084] pci 0000:01:00.0: BAR 4: assigned [mem 0xc000000000-0xc001ffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 0.838150] pci 0000:01:00.0: BAR 0: assigned [mem 0xc002000000-0xc002ffffff 64bit pref]
Jul 26 16:12:49 vosk86 kernel: [ 1.774305] nouveau 0000:01:00.0: enabling device (0000 → 0002)
Jul 26 16:12:49 vosk86 kernel: [ 1.776603] nouveau 0000:01:00.0: unknown chipset (ffffffff)
Jul 26 16:12:49 vosk86 kernel: [ 1.776608] nouveau: probe of 0000:01:00.0 failed with error -12

Jul 26 16:27:00 vosk86 kernel: [ 0.556820] pci 0000:01:00.0: [10de:2331] type 00 class 0x030200
Jul 26 16:27:00 vosk86 kernel: [ 0.556944] pci 0000:01:00.0: reg 0x10: [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.556991] pci 0000:01:00.0: reg 0x18: [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.557037] pci 0000:01:00.0: reg 0x20: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.557159] pci 0000:01:00.0: Enabling HDA controller
Jul 26 16:27:00 vosk86 kernel: [ 0.557762] pci 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:00:1c.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
Jul 26 16:27:00 vosk86 kernel: [ 0.763242] pci 0000:01:00.0: can’t claim BAR 0 [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:27:00 vosk86 kernel: [ 0.763247] pci 0000:01:00.0: can’t claim BAR 2 [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:27:00 vosk86 kernel: [ 0.763250] pci 0000:01:00.0: can’t claim BAR 4 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:27:00 vosk86 kernel: [ 0.790043] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.790048] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.790050] pci 0000:01:00.0: BAR 4: assigned [mem 0xc000000000-0xc001ffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 0.790113] pci 0000:01:00.0: BAR 0: assigned [mem 0xc002000000-0xc002ffffff 64bit pref]
Jul 26 16:27:00 vosk86 kernel: [ 4.214790] nvidia 0000:01:00.0: enabling device (0000 → 0002)
Jul 26 16:27:00 vosk86 kernel: [ 4.217328] NVRM: BAR2 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:27:00 vosk86 kernel: [ 4.217332] NVRM: BAR3 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:27:00 vosk86 kernel: [ 4.217675] NVRM: The NVIDIA GPU 0000:01:00.0
Jul 26 16:27:00 vosk86 kernel: [ 4.218145] nvidia: probe of 0000:01:00.0 failed with error -1
Jul 26 16:27:00 vosk86 kernel: [ 4.687366] NVRM: BAR2 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:27:00 vosk86 kernel: [ 4.689802] NVRM: BAR3 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:27:00 vosk86 kernel: [ 4.690253] NVRM: The NVIDIA GPU 0000:01:00.0
Jul 26 16:27:00 vosk86 kernel: [ 4.690787] nvidia: probe of 0000:01:00.0 failed with error -1

Jul 26 16:30:49 vosk86 kernel: [ 0.566179] pci 0000:01:00.0: [10de:2331] type 00 class 0x030200
Jul 26 16:30:49 vosk86 kernel: [ 0.566303] pci 0000:01:00.0: reg 0x10: [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.566465] pci 0000:01:00.0: reg 0x18: [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.566512] pci 0000:01:00.0: reg 0x20: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.566641] pci 0000:01:00.0: Enabling HDA controller
Jul 26 16:30:49 vosk86 kernel: [ 0.567245] pci 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:00:1c.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
Jul 26 16:30:49 vosk86 kernel: [ 0.760403] pci 0000:01:00.0: can’t claim BAR 0 [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:30:49 vosk86 kernel: [ 0.760406] pci 0000:01:00.0: can’t claim BAR 2 [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:30:49 vosk86 kernel: [ 0.760409] pci 0000:01:00.0: can’t claim BAR 4 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Jul 26 16:30:49 vosk86 kernel: [ 0.784161] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.784165] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.784168] pci 0000:01:00.0: BAR 4: assigned [mem 0xc000000000-0xc001ffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.784238] pci 0000:01:00.0: BAR 0: assigned [mem 0xc002000000-0xc002ffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811045] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811048] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811051] pci 0000:01:00.0: BAR 4: no space for [mem size 0x02000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811053] pci 0000:01:00.0: BAR 4: failed to assign [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811055] pci 0000:01:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.811057] pci 0000:01:00.0: BAR 0: failed to assign [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842752] pci 0000:01:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842755] pci 0000:01:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842757] pci 0000:01:00.0: BAR 4: no space for [mem size 0x02000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842760] pci 0000:01:00.0: BAR 4: failed to assign [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842762] pci 0000:01:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 0.842764] pci 0000:01:00.0: BAR 0: failed to assign [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
Jul 26 16:30:49 vosk86 kernel: [ 4.257212] NVRM: BAR0 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:30:49 vosk86 kernel: [ 4.260728] nvidia: probe of 0000:01:00.0 failed with error -1
Jul 26 16:30:49 vosk86 kernel: [ 4.779756] NVRM: BAR0 is 0M @ 0x0 (PCI:0000:01:00.0)
Jul 26 16:30:49 vosk86 kernel: [ 4.782376] nvidia: probe of 0000:01:00.0 failed with error -1


Also I got this ERRORS

i got same error and i solved it by changing H100 gpu… :(

Do you mean changing the H100 pcie slot? or something?

I am also experiencing a similar problem with 4xH100 machines

Any suggestions ?

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Hi attaching my bug report below
nvidia-bug-report.log (27.6 MB)

Also everything works fine with 8xH100 machine i am getting this issue only in 4xH100 machines

Wrong driver. It’s a VM on a vGPU system so you need to use the GRID driver.

Hi i am using the same driver in 8XH100 VM it works fine for me when trying to do it in 4X vm i am getting this error

On the same host?

No 8x is on another host
4x is on another host

Attaching the bug report after the grid driver installation

nvidia-bug-report.log (872.5 KB)

Ok, the device id is 2330 so those are sxm devices. Don’t know whether those can be passed through so easily. Are those on the other host also sxm or plain pcie?

Yes the other host is also SXM

What driver are the gpus bound to on the host?
sudo lspci -k -d 10de:*

the host dosent have any driver installed in it the host has gpu cards attached to it

however after running the commad in the vm

image

Also in 8x host we have Nvswitch in 4x we dont have can this be the reason ?

So I guess the vfio-pci driver is bound to them for simple pci-passthrough, please switch back to the normal driver.

I wouldn’t expect that, OTOH nvidia is getting weird again regarding pass-through restrictions. Nvidia’s docs are quite ambigious, their fabric manager docs state that H100 SXM5 NVSwitch systems can be passed through while their enterprise ai support matrix states that H100 SXM5 gpus are only supported bare-metal in general.
https://docs.nvidia.com/ai-enterprise/latest/product-support-matrix/index.html#support-matrix__h100-sxm5-bare-metal-only
Whatever that means. I guess you need to contact your vendor or nvidia enterprise support. Maybe also check with the vpu forums, maybe you’ll get an nvidia employee to answer that.

I ran into these issues with our recent H100 PCIe and qemu/kvm virt. It turned out to be buggy GPU VBIOS. An upgrade to the latest provided by our vendor fixed it.

Hi, I am having the same issue with 2xH100 alongside 1x A5000. I am trying to passthrough only the the 2x H100s. One H100 is loaded and is working fine, the second H100 fails with following error messages.

[ 0.902604] pci 0000:02:00.0: [10de:2331] type 00 class 0x030200
[ 0.903470] pci 0000:02:00.0: reg 0x10: [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
[ 0.904112] pci 0000:02:00.0: reg 0x18: [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
[ 0.905217] pci 0000:02:00.0: reg 0x20: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
[ 0.906302] pci 0000:02:00.0: Max Payload Size set to 128 (was 256, max 256)
[ 0.908133] pci 0000:02:00.0: Enabling HDA controller
[ 0.909281] pci 0000:02:00.0: 252.048 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x16 link at 0000:00:05.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[ 0.912200] pci 0000:00:05.0: PCI bridge to [bus 02]
[ 1.042145] pci 0000:02:00.0: can’t claim BAR 0 [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
[ 1.043585] pci 0000:02:00.0: can’t claim BAR 2 [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
[ 1.044062] pci 0000:02:00.0: can’t claim BAR 4 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window

[ 1.131465] pci 0000:00:05.0: BAR 15: no space for [mem size 0x3000000000 64bit pref]
[ 1.132385] pci 0000:00:05.0: BAR 15: failed to assign [mem size 0x3000000000 64bit pref]
[ 1.133335] pci 0000:00:04.0: PCI bridge to [bus 01]
[ 3.571329] pci 0000:00:04.0: bridge window [mem 0xc000000000-0xe002ffffff 64bit pref]
[ 6.729684] pci 0000:02:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
[ 6.731789] pci 0000:02:00.0: BAR 2: failed to assign [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
[ 6.734253] pci 0000:02:00.0: BAR 4: no space for [mem size 0x02000000 64bit pref]
[ 6.736138] pci 0000:02:00.0: BAR 4: failed to assign [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
[ 6.738573] pci 0000:02:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
[ 6.740466] pci 0000:02:00.0: BAR 0: failed to assign [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]
[ 6.742904] pci 0000:00:05.0: PCI bridge to [bus 02]
[ 6.748646] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window]
[ 6.751044] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff window]
[ 6.752599] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[ 6.754296] pci_bus 0000:00: resource 7 [mem 0x80000000-0xafffffff window]
[ 6.756001] pci_bus 0000:00: resource 8 [mem 0xc0000000-0xfebfffff window]
[ 6.757720] pci_bus 0000:00: resource 9 [mem 0xc000000000-0xe003007fff window]
[ 6.759518] pci_bus 0000:01: resource 2 [mem 0xc000000000-0xe002ffffff 64bit pref]

attached is the bug report
nvidia-bug-report.log.gz (276.2 KB)