Memory error when passing ConnectX-4+ NIC to VM

I’m attempting to passthrough a Mellanox ConnectX-4 NIC to a VM and getting memory errors in dmesg:

...
[32278.078269] x86/PAT: CPU 2/KVM:108165 conflicting memory types ea000000-ec000000 uncached-minus<->write-combining
[32278.078271] x86/PAT: memtype_reserve failed [mem 0xea000000-0xebffffff], track uncached-minus, req uncached-minus
[32278.078272] ioremap memtype_reserve failed -16
[32278.082272] x86/PAT: CPU 2/KVM:108165 conflicting memory types ea000000-ec000000 uncached-minus<->write-combining
[32278.082275] x86/PAT: memtype_reserve failed [mem 0xea000000-0xebffffff], track uncached-minus, req uncached-minus
[32278.082277] ioremap memtype_reserve failed -16
[32278.086270] x86/PAT: CPU 2/KVM:108165 conflicting memory types ea000000-ec000000 uncached-minus<->write-combining
[32278.086273] x86/PAT: memtype_reserve failed [mem 0xea000000-0xebffffff], track uncached-minus, req uncached-minus
[32278.086274] ioremap memtype_reserve failed -16
[32278.090268] x86/PAT: CPU 2/KVM:108165 conflicting memory types ea000000-ec000000 uncached-minus<->write-combining
[32278.090270] x86/PAT: memtype_reserve failed [mem 0xea000000-0xebffffff], track uncached-minus, req uncached-minus
[32278.090271] ioremap memtype_reserve failed -16
...

These errors repeat hundreds or thousands of times while the VM is starting. Eventually the VM boots properly, lspci in the guest shows the NIC, but doesn’t load the driver for it so it’s unusable.

I’m able to passthrough other PCI devices like NVMe SSDs and it works fine with no dmesg errors, it’s specifically the Mellanox NICs that have problems. I’ve tried both ConnectX-4 Lx and a ConnectX-6 card. The passthrough works without errors in other machines, it’s specifically with this AM5 platform that I’m having issues. I’ve tried using SR-IOV and passing through just one virtual function and that also causes the same errors. I’ve also tried the NIC in different PCIe slots and the same thing happens. Each port of the NIC is in it’s own IOMMU group, and I’ve tried passing in each individual port, as well as both ports together, each time getting the same errors

This is with a new MSI X670E ACE motherboard with a 7950X CPU. I’m seeing the issue with kernel versions from 5.15 to 6.2. I tried installing MLNX_EN on Ubuntu and that driver resolved the issue, so it seems like the error is in the mlx5_core driver. Unfortunately I’m unable to install MLNX_EN on my primary OS, Fedora.

Hello nvidia.fxzgc,

Thank you for contacting Nvidia support.

For installing MLNX_EN driver on Fedora, use the following driver:

https://www.mellanox.com/downloads/ofed/MLNX_EN-5.4-3.6.8.1/mlnx-en-5.4-3.6.8.1-fc32-x86_64.tgz

The errors are not surely caused by our driver. Furthermore, the installation of the driver and not seeing the errors, may be a coincidence. To be sure, further investigation is needed.

For further investigation, is needed create a support ticket and a support entitlement for your products.

Customers with support entitlement, can create a support ticket via sending an email to EnterpriseSupport@nvidia.com

Best regards,

Nvidia support