HGX A100 VM passthrough issues on Ubuntu 20.04

I’ve been struggling trying to create VMs on an HGX A100 (8x A100 80GB) with more than 1 GPU.
On the host all seems fine, I haven’t been able to create a VM with 4 GPUs passed through. At most I’ve seen 2 GPUs show up in nvidia-smi.

The host runs Ubuntu 20.04, kernel 5.4.0-77-generic, same for VM
Hypervisor KVM version 4.2.1

I’m passing 4 GPUs and 3 switches to the VM.

Adding 2 logs from nvidia-bug-report.sh, the first situation shows 2 GPUs in nvidia-smi, driver 465.19.01
nvidia-bug-report.log_1.gz (75.3 KB)

Added pcie_aspm=off in boot parameters (no GPUs show up in nvidia-smi), driver 470.42.01
nvidia-bug-report.log_2.gz (431.3 KB)

In the logs where 2 GPUs show up, lspci -d “10de:*” -v -xxx shows for the 2 failing GPUs this:

03:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
	Subsystem: NVIDIA Corporation Device 1463
	Flags: fast devsel, IRQ 11
	Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
	Memory at <ignored> (64-bit, prefetchable)
	Memory at <ignored> (64-bit, prefetchable)

For the 2 that work it shows:

    Flags: fast devsel, IRQ 11
    Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 8000000000 (64-bit, prefetchable) [size=128G]
    Memory at a000000000 (64-bit, prefetchable) [size=32M]

So our issue seems to be there. I have not been able to get a VM with i440fx working and we generally use Q35.

Kernel says
Some PCI device resources are unassigned, try booting with pci=realloc

After successfully using i440fx we seem to be 1 step further, all 4 GPUs show up in nvidia-smi.

After installing fabric manager and nvidia DCGM we’re still not able to run any of the CUDA samples though;

root@4a:~/cuda-samples/Samples/p2pBandwidthLatencyTest# ./p2pBandwidthLatencyTest
Cuda failure p2pBandwidthLatencyTest.cu:610: 'system not yet initialized'

nvidia-bug-report.log_3.gz (1.1 MB)

Edit: after making sure fabric manager is running, we are able to run the samples, however our P2P transfers don’t utilize the switches. In the following test I passed through all 6 switches;
nvidia-bug-report.log_4.gz (1.3 MB)

Fabric manager seems to be loaded correctly though;

root@4a:~# service nvidia-fabricmanager status
● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2021-07-08 20:41:43 UTC; 14min ago
    Process: 2742 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
   Main PID: 2744 (nv-fabricmanage)
      Tasks: 16 (limit: 599434)
     Memory: 12.5M
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─2744 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Jul 08 20:41:33 4a systemd[1]: Starting NVIDIA fabric manager service...
Jul 08 20:41:43 4a nv-fabricmanager[2744]: Successfully configured all the available GPUs and NVSwitches.
Jul 08 20:41:43 4a systemd[1]: Started NVIDIA fabric manager service.

Another update, if we set FABRIC_MODE=0 in /usr/share/nvidia/nvswitch/fabricmanager.cfg with all 6 switches passed through, then we are able to leverage full NVLink bandwidth.
However if we pass through only (any) 5 switches, we can’t start the fabric manager service. (We want to create 2VMs, both containing 4 GPUs and 3 switches).
=> This makes sense since we need to configure GPUs and switches according to the manual, however an example would be helpful in setting that up. What is specifically unclear is how to:
- Block disabled NVLink connections on each GPU by performing the specified MMIO configuration
- Block disabled NVLink connections on each switch by configuring the MMIO intercept

Additionally, we can’t successfully pass more than 4 GPUs. This is a log from the VM to which we pass 8 GPUs and all 6 switches;
nvidia-bug-report.log_5_VM.gz (1.4 MB)
This info is from the host:
lspci_host.txt (1.0 KB)
dmesg_host.txt (279.2 KB)

On that VM we are able to use full NVlink bandwidth but only 4 GPUs show up.

It looks like configuring the GPUs and switches according to the fabric manager manual will fix our first issue, however we are still not able to pass through all GPUs and switches to a single VM, at most we are able to load 4 GPUs.

lspci -v output looks like this for GPUs that are working in the VM: (command ran inside VM)

04:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
        Subsystem: NVIDIA Corporation Device 1463
        Physical Slot: 0-3
        Flags: bus master, fast devsel, latency 0, IRQ 20
        Memory at d2000000 (32-bit, non-prefetchable) [size=16M]
        Memory at a000000000 (64-bit, prefetchable) [size=128G]
        Memory at c000000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

The GPUs that dont work have 1 or more entries Memory at <ignored>

0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
        Subsystem: NVIDIA Corporation Device 1463
        Physical Slot: 0-11
        Flags: fast devsel, IRQ 16
        Memory at <ignored> (32-bit, non-prefetchable) [disabled]
        Memory at <ignored> (64-bit, prefetchable) [disabled]
        Memory at <ignored> (64-bit, prefetchable) [disabled]
        Capabilities: [60] Power Management version 3
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

The VM is running Q35 with UEFI, boot parameters pci=realloc=off rcutree.rcu_idle_gp_delay=1 mem_encrypt=off pci=nocrs,noearly

It seems that my IO address space was limited to 1024GB, which was not enough.
Setting qemu command -cpu host fixed it, address space was increased to accommodate all GPUs.

I’ve also set qemu parameter -global q35-pcihost.pci-hole64-size=2048G which is also required to make it work.

I also removed pci=nocrs from kernel parameters.

Other info that might be helping others is that the machine type is q35 and running UEFI.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.