Proxmox GPU Passthrough crashes host RTX PRO 6000 and RTX 5090

Hi,

Specs:
Motherboard GENOA2D24G-2L+
CPU: 2x AMD EPYC 9654 96-Core Processor
GPU: 5x RTX PRO 6000 blackwell and 6x RTX 5090

I already lost 2 weeks on solving this and here is what i had and what i have solved in short.

I am using vfio passthrough in Proxmox 8.2 with RTX PRO 6000 blackwell and RTX5090 blackwell. I cannot get it stable.

  1. When VM was booted with 2 GPUs, inside linux VM the gpus were seen in lspci but only one in nvidia-smi. Using ovmi (uefi) bios helped and solved that.
  2. Booting either windows or linux with 2 or more GPUs was causing crash on host:
    CPU soft lockups, vfio device not responding,
    [490431.821151] vfio-pci 0000:81:00.0: Unable to change power state from D3cold to D0, device inaccessible
    Here also setting ovmi(uefi) bios helped but linux is booting a lot longer than SeaBios.

So far i have fixed issues for creating VMs. But here is one more.
If VM is working some time (windos or linux) and then it is closed (not sure if guest is putting it to hibernation or shutting it down) , i am getting:

[79929.589585] tap12970056i0: entered promiscuous mode
[79929.618943] wanbr: port 3(tap12970056i0) entered blocking state
[79929.618949] wanbr: port 3(tap12970056i0) entered disabled state
[79929.619056] tap12970056i0: entered allmulticast mode
[79929.619260] wanbr: port 3(tap12970056i0) entered blocking state
[79929.619262] wanbr: port 3(tap12970056i0) entered forwarding state
[104065.181539] tap12970056i0: left allmulticast mode
[104065.181689] wanbr: port 3(tap12970056i0) entered disabled state
[104069.337819] vfio-pci 0000:41:00.0: not ready 1023ms after FLR; waiting
[104070.425845] vfio-pci 0000:41:00.0: not ready 2047ms after FLR; waiting
[104072.537878] vfio-pci 0000:41:00.0: not ready 4095ms after FLR; waiting
[104077.018008] vfio-pci 0000:41:00.0: not ready 8191ms after FLR; waiting
[104085.722212] vfio-pci 0000:41:00.0: not ready 16383ms after FLR; waiting
[104102.618637] vfio-pci 0000:41:00.0: not ready 32767ms after FLR; waiting
[104137.947487] vfio-pci 0000:41:00.0: not ready 65535ms after FLR; giving up
[104164.933500] watchdog: BUG: soft lockup - CPU#48 stuck for 27s! [kvm:3713788]
[104164.933536] Modules linked in: ebtable_filter ebtables ip_set sctp wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics nvme_keyring 8021q garp mrp bonding ip6table_filter ip6table_raw ip6_tables xt_conntrack xt_comment softdog xt_tcpudp iptable_filter sunrpc xt_MASQUERADE xt_addrtype iptable_nat nf_nat nf_conntrack binfmt_misc nf_defrag_ipv6 nf_defrag_ipv4 nfnetlink_log libcrc32c nfnetlink iptable_raw intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd dax_hmem cxl_acpi cxl_port rapl cxl_core pcspkr ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ast k10temp ccp ipmi_msghandler joydev input_leds mac_hid zfs(PO) spl(O) vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 mlx5_ib ib_uverbs
[104164.933620]  macsec ib_core hid_generic usbkbd usbmouse cdc_ether usbhid usbnet hid mii mlx5_core mlxfw psample igb xhci_pci tls nvme i2c_algo_bit xhci_pci_renesas crc32_pclmul dca pci_hyperv_intf nvme_core ahci xhci_hcd libahci nvme_auth i2c_piix4
[104164.933651] CPU: 48 PID: 3713788 Comm: kvm Tainted: P           O       6.8.12-11-pve #1
[104164.933654] Hardware name: To Be Filled By O.E.M. GENOA2D24G-2L+/GENOA2D24G-2L+, BIOS 2.06 05/06/2024
[104164.933656] RIP: 0010:pci_mmcfg_read+0xcb/0x110
[104164.933666] Code: 45 31 c9 e9 a2 18 26 00 4c 01 e8 66 8b 00 0f b7 c0 41 89 04 24 eb c9 4c 01 e8 8a 00 0f b6 c0 41 89 04 24 eb bb 4c 01 e8 8b 00 <41> 89 04 24 eb b0 e8 4a 57 11 ff 41 c7 04 24 ff ff ff ff 48 83 c4
[104164.933668] RSP: 0018:ff69a14eeae63890 EFLAGS: 00000286
[104164.933670] RAX: 00000000ffffffff RBX: 0000000004100000 RCX: 0000000000000ffc
[104164.933672] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[104164.933673] RBP: ff69a14eeae638c0 R08: 0000000000000004 R09: ff69a14eeae638e4
[104164.933675] R10: 0000000000000041 R11: ffffffff876a27c0 R12: ff69a14eeae638e4
[104164.933676] R13: 0000000000000ffc R14: 0000000000000000 R15: 0000000000000004
[104164.933677] FS:  0000000000000000(0000) GS:ff188583fb800000(0000) knlGS:0000000000000000
[104164.933679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[104164.933680] CR2: 000076603dbe8d20 CR3: 0000017bd2236003 CR4: 0000000000f71ef0
[104164.933682] PKRU: 55555554
[104164.933683] Call Trace:
[104164.933686]  <IRQ>
[104164.933691]  ? show_regs+0x6d/0x80
[104164.933696]  ? watchdog_timer_fn+0x206/0x290
[104164.933701]  ? __pfx_watchdog_timer_fn+0x10/0x10
[104164.933703]  ? __hrtimer_run_queues+0x105/0x280
[104164.933709]  ? hrtimer_interrupt+0xf6/0x250
[104164.933714]  ? __sysvec_apic_timer_interrupt+0x4e/0x120
[104164.933719]  ? sysvec_apic_timer_interrupt+0x8d/0xd0
[104164.933724]  </IRQ>
[104164.933725]  <TASK>
[104164.933727]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[104164.933732]  ? __pfx_pci_mmcfg_read+0x10/0x10
[104164.933735]  ? pci_mmcfg_read+0xcb/0x110
[104164.933738]  pci_read+0x52/0x90
[104164.933741]  pci_bus_read_config_dword+0x47/0x90
[104164.933746]  pci_read_config_dword+0x27/0x50
[104164.933748]  pci_find_next_ext_capability+0x83/0xe0
[104164.933753]  pci_find_ext_capability+0x12/0x20
[104164.933755]  pci_restore_ltr_state+0x29/0x60
[104164.933759]  pci_restore_state.part.0+0x2d/0x3a0
[104164.933764]  pci_restore_state+0x1e/0x30
[104164.933768]  vfio_pci_core_disable+0x3e2/0x450 [vfio_pci_core]
[104164.933774]  vfio_pci_core_close_device+0x64/0xd0 [vfio_pci_core]
[104164.933779]  vfio_df_close+0x5c/0xb0 [vfio]
[104164.933784]  vfio_df_group_close+0x37/0x80 [vfio]
[104164.933788]  vfio_device_fops_release+0x25/0x50 [vfio]
[104164.933792]  __fput+0xa0/0x2e0
[104164.933797]  ____fput+0xe/0x20
[104164.933799]  task_work_run+0x5e/0xa0
[104164.933804]  do_exit+0x386/0xae0
[104164.933809]  do_group_exit+0x35/0x90
[104164.933812]  __x64_sys_exit_group+0x18/0x20
[104164.933814]  x64_sys_call+0x2001/0x2480
[104164.933818]  do_syscall_64+0x81/0x170
[104164.933824]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933826]  ? xas_find+0x6e/0x1d0
[104164.933831]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933833]  ? next_uptodate_folio+0x93/0x290
[104164.933839]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933841]  ? filemap_map_pages+0x4b8/0x5b0
[104164.933844]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933846]  ? ptep_set_access_flags+0x4a/0x70
[104164.933850]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933852]  ? wp_page_reuse+0x95/0xc0
[104164.933857]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933859]  ? do_wp_page+0x1c5/0xc10
[104164.933862]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933864]  ? __pte_offset_map+0x1c/0x1b0
[104164.933868]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933870]  ? __handle_mm_fault+0xba9/0xf70
[104164.933875]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933877]  ? __count_memcg_events+0x6f/0xe0
[104164.933881]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933883]  ? count_memcg_events.constprop.0+0x2a/0x50
[104164.933886]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933888]  ? handle_mm_fault+0xad/0x380
[104164.933891]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933893]  ? do_user_addr_fault+0x33f/0x660
[104164.933895]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933897]  ? irqentry_exit_to_user_mode+0x7b/0x260
[104164.933901]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933903]  ? irqentry_exit+0x43/0x50
[104164.933905]  ? srso_alias_return_thunk+0x5/0xfbef5
[104164.933907]  ? exc_page_fault+0x94/0x1b0
[104164.933911]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[104164.933912] RIP: 0033:0x79f2c0849409
[104164.933947] Code: Unable to access opcode bytes at 0x79f2c08493df.

This happens exactly after shutting down VM. I seen it on linux and windows VM. And they had ovmi(uefi bioses).
After that host is lagging and GPU is not accessible.

PCI-E lines are all x16 gen 5.0 and not issues here if i am using GPUs directly.
What can i do ?

RTX PRO 6000 blackwell 96GB - BIOS: 98.02.52.00.02

root@d:/etc/modprobe.d#
cat vfio.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1
options kvm ignore_msrs=1 report_ignored_msrs=0
options vfio-pci ids=10de:2bb1,10de:22e8,10de:2b85 disable_vga=1 disable_idle_d3=1

cat blacklist-gpu.conf
blacklist radeon
blacklist nouveau
blacklist nvidia
# Additional NVIDIA related blacklists
blacklist snd_hda_intel
blacklist amd76x_edac
blacklist vga16fb
blacklist rivafb
blacklist nvidiafb
blacklist rivatv

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 vfio-pci.ids=10de:22e8,10de:2b85"

Tried all kind of different kernels, 6.8.12-11-pve

Please help. Is it like vbios issue ? Do i need to upgrade firmware in GPUs ?
Do i need something else ? I am really desperate.

2 Likes

I have checked an it happens on 6 different servers, RTX5090 two different brands and RTX6000. But still all blackwells, maybe there is issue with those ? Anyone from NVIDIA devs could take a look into this maybe ?

1 Like

HP z840
Motherboard 761510-001 905483-601 905483-601
CPU: 2x Intel Xeon 2699v4
GPU: 1x RTX PRO 6000 blackwell

I second this. I am finding that my Proxmox hypervisor crashes consistently when fine tuning with the RTX PRO 6000 with passthrough to a VM (axoltol). Only when fine tuning though. Running the model itself is fine but something is triggering my crash right when the final image is being writing back to the hard drive from the GPU’s vRAM.

I’m going to watch this thread closely as I’m stuck in the water until this gets resolved. I am not running two gpus, Im finding this for me is occurring only with one installed to the VM.

I am using ZFS but our startup options Grub vs SystemD look almost 100% the same.

I have another z840 with almost the same hardware, I migrated the card and the VM and confirmed the hypervisor crashing while writing the fine tuned file.

Can you give me some set of commands or packed vm image and set of commands that will give me ability to trigger this state ?
Because i am making everything blind, that is changing something in config and then i need to wait few days until it will crash (until someone will do something that triggers this issue)
Giving me ability to trigger that ondemand would be very cool !!
Can you prepare that for me ? and send it to me via pm ?

I think I can get you mostly there, off the top of my head my deployment is

Ubuntu 22-04.3-desktop.amd64.iso

conda create --name axoltol

conda activate axoltol

install python and pip

cd ~

pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja
pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]

Download example axolotl configs, deepspeed configs

axolotl fetch examples
axolotl fetch deepspeed_configs

after, upgrade torch to support sm_120
pip install --upgrade torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu128
sudo apt update && sudo apt installlobopenmpi-dev openmpi-bin
pip install mpi4py

#Install nivida open driver
sudo apt install nvvidia-utils-575-open -y
sudo apt install nvidia-utils575
sudo apt install nvidia-cuda-toolkit -y
sudo update-initramfs -u
sudo reboot

conda activate axoltol
cd ~/axoltol/examples/llama-3
accelerate launch -m axolotl.cli.train fft-8b.yaml
it will download the models and train it to the alpaca example dataset (tatsu-lab/alpaca · Datasets at Hugging Face)