Hi,
I have 3 different Servers
HP DL380 Plus Gen 10+ with Intel 6254 → 4 X NVIDIA A4000
Threadripper 7960X based Server → 2 X A5000
Ryzen 7900X based Server → 2 X NVIDIA A4000
I am currently using A4000/A5000 in all of them along with Ubuntu 24.04 TLS (latest 6.8.49-genric) in full pass-through mode to a Windows VM.
Randomly either the A4000/A5000 fails randomly on the above server and when i start the VM, it says pci not available and on pci rescan the GPU falls of bus/not available till i hard reboot the server
Initially i though thats due to heating or something, but heres what i have tried
- RAM Check (all ok on all servers)
- Swapped GPUs from Server to Server
- Updated to Latest Bios/ILO what ever possible
- Enabled/Disabled → Overclocking, SR-IOV, Precision-Boost over-drive, PCI Power limits etc and all settings
- Updated latest OS/KVM/Kernal
- enable/disable
- disable_idle_d3 and pci_asm power management and also pstate
Here are my settings
$ grub
“quiet amd_iommu=on iommu=pt textonly initcall_blacklist=sysfb_init pcie_aspm=off pcie_port_pm=off kvm.ignore_msrs=1 vfio_iommu_type1.allow_unsafe_interrupts=1 vfio-pci.disable_vga=1 vfio-pci.disable_idle_d3=1 vfio-pci.ids=10de:2882,10de:22be”
echo “blacklist nouveau” >> /etc/modprobe.d/blacklist.conf
echo “blacklist nvidia” >> /etc/modprobe.d/blacklist.conf
echo “options vfio_iommu_type1 allow_unsafe_interrupts=1” > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo “options kvm ignore_msrs=1 report_ignored_msrs=0” > /etc/modprobe.d/kvm.conf
echo “options vfio-pci ids=10de:2882,10de:22be disable_vga=1 disable_idle_d3=1”> /etc/modprobe.d/vfio.conf
Nothing works,
My requirements:
- I need to keep running lot of Visualization/Simulation and AI inference so i keep running VMs on and off like 20-40 times a day across all gpus/servers
- Everything runs fine and one sudden day in a week 1-2 gpu just crashes and falls off bus and i need to reboot the entire host to get it back
- CPU load, power, temperature are all fine as i have seen logs and its a datacenter environment with proper cooling under 60c-70c for GPU at max load
- VMs are proper shutdown
- Ran different kind of workload for hours, but all works fine (gaming, simulation, full memory load and full 100% gpu compute), but then randomly on next VM boot PCI is unavailable
Please help, i am not sure if its a common problem or not, this are all enterprise cards, and its not like 1-2 cards are issue, its random on each server, strangely its not regular but surely once or twice in a week or two
Its very unstable, i need to reboot entire host to get it back
There is nothing in dmesg or kernal logs, its just says pci unavailable to start the vm