NVIDIA A4000 GPUs Falling off Bus/PCI Not Found Randomly

Hi,

I have 3 different Servers

HP DL380 Plus Gen 10+ with Intel 6254 → 4 X NVIDIA A4000
Threadripper 7960X based Server → 2 X A5000
Ryzen 7900X based Server → 2 X NVIDIA A4000

I am currently using A4000/A5000 in all of them along with Ubuntu 24.04 TLS (latest 6.8.49-genric) in full pass-through mode to a Windows VM.

Randomly either the A4000/A5000 fails randomly on the above server and when i start the VM, it says pci not available and on pci rescan the GPU falls of bus/not available till i hard reboot the server

Initially i though thats due to heating or something, but heres what i have tried

  1. RAM Check (all ok on all servers)
  2. Swapped GPUs from Server to Server
  3. Updated to Latest Bios/ILO what ever possible
  4. Enabled/Disabled → Overclocking, SR-IOV, Precision-Boost over-drive, PCI Power limits etc and all settings
  5. Updated latest OS/KVM/Kernal
  6. enable/disable
  7. disable_idle_d3 and pci_asm power management and also pstate

Here are my settings

$ grub
“quiet amd_iommu=on iommu=pt textonly initcall_blacklist=sysfb_init pcie_aspm=off pcie_port_pm=off kvm.ignore_msrs=1 vfio_iommu_type1.allow_unsafe_interrupts=1 vfio-pci.disable_vga=1 vfio-pci.disable_idle_d3=1 vfio-pci.ids=10de:2882,10de:22be”

echo “blacklist nouveau” >> /etc/modprobe.d/blacklist.conf
echo “blacklist nvidia” >> /etc/modprobe.d/blacklist.conf

echo “options vfio_iommu_type1 allow_unsafe_interrupts=1” > /etc/modprobe.d/iommu_unsafe_interrupts.conf

echo “options kvm ignore_msrs=1 report_ignored_msrs=0” > /etc/modprobe.d/kvm.conf

echo “options vfio-pci ids=10de:2882,10de:22be disable_vga=1 disable_idle_d3=1”> /etc/modprobe.d/vfio.conf

Nothing works,

My requirements:

  1. I need to keep running lot of Visualization/Simulation and AI inference so i keep running VMs on and off like 20-40 times a day across all gpus/servers
  2. Everything runs fine and one sudden day in a week 1-2 gpu just crashes and falls off bus and i need to reboot the entire host to get it back
  3. CPU load, power, temperature are all fine as i have seen logs and its a datacenter environment with proper cooling under 60c-70c for GPU at max load
  4. VMs are proper shutdown
  5. Ran different kind of workload for hours, but all works fine (gaming, simulation, full memory load and full 100% gpu compute), but then randomly on next VM boot PCI is unavailable

Please help, i am not sure if its a common problem or not, this are all enterprise cards, and its not like 1-2 cards are issue, its random on each server, strangely its not regular but surely once or twice in a week or two

Its very unstable, i need to reboot entire host to get it back

There is nothing in dmesg or kernal logs, its just says pci unavailable to start the vm