I’ve been struggling trying to create VMs on an HGX A100 (8x A100 80GB) with more than 1 GPU.
On the host all seems fine, I haven’t been able to create a VM with 4 GPUs passed through. At most I’ve seen 2 GPUs show up in nvidia-smi.
The host runs Ubuntu 20.04, kernel 5.4.0-77-generic, same for VM
Hypervisor KVM version 4.2.1
I’m passing 4 GPUs and 3 switches to the VM.
Adding 2 logs from nvidia-bug-report.sh, the first situation shows 2 GPUs in nvidia-smi, driver 465.19.01 nvidia-bug-report.log_1.gz (75.3 KB)
Added pcie_aspm=off in boot parameters (no GPUs show up in nvidia-smi), driver 470.42.01 nvidia-bug-report.log_2.gz (431.3 KB)
In the logs where 2 GPUs show up, lspci -d “10de:*” -v -xxx shows for the 2 failing GPUs this:
03:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
Subsystem: NVIDIA Corporation Device 1463
Flags: fast devsel, IRQ 11
Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
Memory at <ignored> (64-bit, prefetchable)
Memory at <ignored> (64-bit, prefetchable)
For the 2 that work it shows:
Flags: fast devsel, IRQ 11
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at 8000000000 (64-bit, prefetchable) [size=128G]
Memory at a000000000 (64-bit, prefetchable) [size=32M]
So our issue seems to be there. I have not been able to get a VM with i440fx working and we generally use Q35.
After successfully using i440fx we seem to be 1 step further, all 4 GPUs show up in nvidia-smi.
After installing fabric manager and nvidia DCGM we’re still not able to run any of the CUDA samples though;
root@4a:~/cuda-samples/Samples/p2pBandwidthLatencyTest# ./p2pBandwidthLatencyTest
Cuda failure p2pBandwidthLatencyTest.cu:610: 'system not yet initialized'
Edit: after making sure fabric manager is running, we are able to run the samples, however our P2P transfers don’t utilize the switches. In the following test I passed through all 6 switches; nvidia-bug-report.log_4.gz (1.3 MB)
Fabric manager seems to be loaded correctly though;
root@4a:~# service nvidia-fabricmanager status
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2021-07-08 20:41:43 UTC; 14min ago
Process: 2742 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
Main PID: 2744 (nv-fabricmanage)
Tasks: 16 (limit: 599434)
Memory: 12.5M
CGroup: /system.slice/nvidia-fabricmanager.service
└─2744 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
Jul 08 20:41:33 4a systemd[1]: Starting NVIDIA fabric manager service...
Jul 08 20:41:43 4a nv-fabricmanager[2744]: Successfully configured all the available GPUs and NVSwitches.
Jul 08 20:41:43 4a systemd[1]: Started NVIDIA fabric manager service.
Another update, if we set FABRIC_MODE=0 in /usr/share/nvidia/nvswitch/fabricmanager.cfg with all 6 switches passed through, then we are able to leverage full NVLink bandwidth. However if we pass through only (any) 5 switches, we can’t start the fabric manager service. (We want to create 2VMs, both containing 4 GPUs and 3 switches).
=> This makes sense since we need to configure GPUs and switches according to the manual, however an example would be helpful in setting that up. What is specifically unclear is how to: - Block disabled NVLink connections on each GPU by performing the specified MMIO configuration - Block disabled NVLink connections on each switch by configuring the MMIO intercept
Additionally, we can’t successfully pass more than 4 GPUs. This is a log from the VM to which we pass 8 GPUs and all 6 switches; nvidia-bug-report.log_5_VM.gz (1.4 MB)
This info is from the host: lspci_host.txt (1.0 KB) dmesg_host.txt (279.2 KB)
On that VM we are able to use full NVlink bandwidth but only 4 GPUs show up.
It looks like configuring the GPUs and switches according to the fabric manager manual will fix our first issue, however we are still not able to pass through all GPUs and switches to a single VM, at most we are able to load 4 GPUs.
lspci -v output looks like this for GPUs that are working in the VM: (command ran inside VM)
04:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
Subsystem: NVIDIA Corporation Device 1463
Physical Slot: 0-3
Flags: bus master, fast devsel, latency 0, IRQ 20
Memory at d2000000 (32-bit, non-prefetchable) [size=16M]
Memory at a000000000 (64-bit, prefetchable) [size=128G]
Memory at c000000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
The GPUs that dont work have 1 or more entries Memory at <ignored>
It seems that my IO address space was limited to 1024GB, which was not enough.
Setting qemu command -cpu host fixed it, address space was increased to accommodate all GPUs.
I’ve also set qemu parameter -global q35-pcihost.pci-hole64-size=2048G which is also required to make it work.
I also removed pci=nocrs from kernel parameters.
Other info that might be helping others is that the machine type is q35 and running UEFI.