Mlx5_core will cause a leak of a command resource

always got follow error. sw/hw info attached

Aug 30 11:12:53 MM kernel: [323391.181538] mlx5_core 0000:81:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0061 address=0xfca97000 flags=0x0020]
Aug 30 11:12:54 MM kernel: [323392.574617] mlx5_core 0000:81:00.0: wait_func_handle_exec_timeout:1104:(pid 105326): cmd[0]: 2RST_QP(0x50a) No done completion
Aug 30 11:12:54 MM kernel: [323392.574626] mlx5_core 0000:81:00.0: wait_func:1132:(pid 105326): 2RST_QP(0x50a) timeout.Will cause a leak of a command resource
Aug 30 11:12:54 MM kernel: [323392.574638] infiniband mlx5_1: destroy_qp_common:2625:(pid 105326): mlx5_ib: modify QP 0x0002f1 to RESET failed
Aug 30 11:12:54 MM kernel: [323392.574974] mlx5_core 0000:81:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0061 address=0xfc771800 flags=0x0000]
Aug 30 11:12:54 MM kernel: [323392.576051] mlx5_core 0000:81:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0061 address=0xfc771840 flags=0x0000]
Aug 30 11:12:54 MM kernel: [323392.576913] mlx5_core 0000:81:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0061 address=0xfc771980 flags=0x0000]
Aug 30 11:12:55 MM kernel: [323393.274775] mlx5_core 0000:81:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0061 address=0xfca97000 flags=0x0020]
Aug 30 11:12:55 MM kernel: [323393.278706] mlx5_core 0000:81:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0061 address=0xfca97040 flags=0x0020]
Aug 30 11:12:55 MM kernel: [323393.758637] amd_iommu_report_page_fault: 127 callbacks suppressed
Aug 30 11:12:55 MM kernel: [323393.758640] mlx5_core 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0061 address=0xea831400 flags=0x0000]
Aug 30 11:12:55 MM kernel: [323393.759408] mlx5_core 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0061 address=0xea831400 flags=0x0000]
$ lspci | grep -i Mellanox
81:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
81:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
c1:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
# ethtool -I eth3
driver: mlx5_core
version: 5.4-1.0.3
firmware-version: 16.31.1014 (MT_0000000010)
expansion-rom-version:
bus-info: 0000:c1:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic

# uname -a
Linux MM-IDC-CPU-10-50-1-60 5.4.0-84-generic #94~18.04.1-Ubuntu SMP Thu Aug 26 23:17:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Hi,

Thank you for submitting your query on NVIDIA Developer Forum.

I would like to request to check the output of " #cat /proc/cmdline " to check if the GRUB has the following kernel parameter: “iommu=pt”

This parameter is important on systems with AMD CPU. If it doesn’t exist, please modify the GRUB and add the above mentioned parameter and reboot the system to confirm if issue is resolved.

Thanks,
Namrata.

there is no iommu=pt. we will try. thanks.

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.4.0-84-generic root=/dev/mapper/ubuntu--vg-root ro

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.