Hello,
I currently have the NVIDIA GH100 H200 SXM card which has 8 GPUs, I’m trying to split these GPUs to be used through PCI Passthrough inside VMs (2 GPU per VM) managed by a KVM host. I was able to set things up and list the 2 GPUs inside the guest system. Problem is when I install the driver on the guest, it was able to only initialize one GPU correctly but not the second, and nvidia-smi confirms that it only sees one GPU card. resetting the second GPU didn’t work either.
I have the following logs from kernel at boot up: (0000:01:01.0 being the second GPU that’s failing)
[Fri Nov 15 11:41:15 2024] pci 0000:01:01.0: [10de:2335] type 00 class 0x030200
[Fri Nov 15 11:41:15 2024] pci 0000:01:01.0: reg 0x10: [mem 0x388804000000-0x388804ffffff 64bit pref]
[Fri Nov 15 11:41:15 2024] pci 0000:01:01.0: reg 0x18: [mem 0x380000000000-0x383fffffffff 64bit pref]
[Fri Nov 15 11:41:15 2024] pci 0000:01:01.0: reg 0x20: [mem 0x388800000000-0x388801ffffff 64bit pref]
[Fri Nov 15 11:41:15 2024] pci 0000:01:01.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 0000:00:03.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[Fri Nov 15 11:41:15 2024] pci 0000:01:01.0: can't claim BAR 2 [mem 0x380000000000-0x383fffffffff 64bit pref]: no compatible bridge window
[Fri Nov 15 11:41:15 2024] pci 0000:01:01.0: BAR 2: no space for [mem size 0x4000000000 64bit pref]
[Fri Nov 15 11:41:15 2024] pci 0000:01:01.0: BAR 2: failed to assign [mem 0x380000000000-0x383fffffffff 64bit pref]
NVRM: BAR2 is 0M @ 0x0 (PCI:0000:01:01.0)
NVRM: BAR3 is 0M @ 0x0 (PCI:0000:01:01.0)
[Fri Nov 15 11:41:29 2024] nvidia 0000:01:01.0: firmware: direct-loading firmware nvidia/565.57.01/gsp_ga10x.bin
[Fri Nov 15 11:41:31 2024] resource sanity check: requesting [mem 0x388804700000-0x3888066fffff], which spans more than 0000:01:01.0 [mem 0x388804000000-0x388804ffffff 64bit pref]
[Fri Nov 15 11:41:31 2024] NVRM: GPU 0000:01:01.0: RmInitAdapter failed! (0x24:0x72:1107)
[Fri Nov 15 11:41:31 2024] NVRM: GPU 0000:01:01.0: rm_init_adapter failed, device minor number 0
[Fri Nov 15 11:41:31 2024] nvidia 0000:01:01.0: firmware: direct-loading firmware nvidia/565.57.01/gsp_ga10x.bin
These are the logs of the working GPU
[Fri Nov 15 11:41:15 2024] pci 0000:01:02.0: [10de:2335] type 00 class 0x030200
[Fri Nov 15 11:41:15 2024] pci 0000:01:02.0: reg 0x10: [mem 0x388805000000-0x388805ffffff 64bit pref]
[Fri Nov 15 11:41:15 2024] pci 0000:01:02.0: reg 0x18: [mem 0x384000000000-0x387fffffffff 64bit pref]
[Fri Nov 15 11:41:15 2024] pci 0000:01:02.0: reg 0x20: [mem 0x388802000000-0x388803ffffff 64bit pref]
[Fri Nov 15 11:41:15 2024] pci 0000:01:02.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 0000:00:03.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[Fri Nov 15 11:41:31 2024] nvidia 0000:01:02.0: firmware: direct-loading firmware nvidia/565.57.01/gsp_ga10x.bin
[Fri Nov 15 11:41:33 2024] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:02.0 on minor 1
01:01.0 3D controller: NVIDIA Corporation GH100 [H200 SXM 141GB] (rev a1)
01:02.0 3D controller: NVIDIA Corporation GH100 [H200 SXM 141GB] (rev a1)
I’m running latest Debian 12 and latest nvidia driver (565) + Qemu/i440fx chipset
I also tested the H200 card directly on the host and all gpus were working correctly