We are running the following setup
Ubuntu 22.04 - 5.19.0-41-generic
Dual A100 Cards PCIE
NVIDIA driver - 525.85.07
KVM/Libvirt
OpenStack ZED
We have both GPU set to MIG mode, with the following slices
1x 3g.40Gb
1x 2g.20Gb
4x 1g.10Gb
Everything works as expected apart from now and again when a user destroys/creates a vgpu backed instance the following occurs, usually resulting in a hardlockup of the entire hypervisor.
- CPU and load avg increases significantly
- DMESG reports stuck CPU, this is normally pointed at either the NVIDIA GPU mgr process, or something related to the driver.
- We sometimes see the following in the logs
WARNING: kernel stack frame pointer at 00000000d007fb49 in nvidia-vgpu-mgr:3794759 has bad value 000000000af58e6e
Jul 11 11:16:04 ock00035 kernel: [621872.807705] [nvidia-vgpu-vfio] 0839f86e-4eab-4622-b81c-c77532e569f3: Register read failed. index: 0 offset: 0xfeb80000 status: 0x65 Timeout occured
- libvirt starts using a load of CPU.
At this point the hypervisor usually hard locks. Sometimes we dont see these messages but the hypervisor sitll becomes unresponsive.
Any one have any ideas i can nvidia-debug logs if needed.