It looks like configuring the GPUs and switches according to the fabric manager manual will fix our first issue, however we are still not able to pass through all GPUs and switches to a single VM, at most we are able to load 4 GPUs.
lspci -v output looks like this for GPUs that are working in the VM: (command ran inside VM)
04:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
Subsystem: NVIDIA Corporation Device 1463
Physical Slot: 0-3
Flags: bus master, fast devsel, latency 0, IRQ 20
Memory at d2000000 (32-bit, non-prefetchable) [size=16M]
Memory at a000000000 (64-bit, prefetchable) [size=128G]
Memory at c000000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
The GPUs that dont work have 1 or more entries Memory at <ignored>
0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
Subsystem: NVIDIA Corporation Device 1463
Physical Slot: 0-11
Flags: fast devsel, IRQ 16
Memory at <ignored> (32-bit, non-prefetchable) [disabled]
Memory at <ignored> (64-bit, prefetchable) [disabled]
Memory at <ignored> (64-bit, prefetchable) [disabled]
Capabilities: [60] Power Management version 3
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
The VM is running Q35 with UEFI, boot parameters pci=realloc=off rcutree.rcu_idle_gp_delay=1 mem_encrypt=off pci=nocrs,noearly