Hello! There is a problem. Today my server with mellanox 100G card stop working.
In messages many such entries:
Jun 2 06:29:39 138224 kernel: [934519.424191] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.424195] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.424198] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.424199] 00000030: 00 00 00 00 04 00 51 04 0a 00 02 83 b8 63 dc d2
Jun 2 06:29:39 138224 kernel: [934519.424202] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x63, len: 128
Jun 2 06:29:39 138224 kernel: [934519.424203] 00000000: 00 b8 63 0a 00 02 83 05 00 00 00 08 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.424205] 00000010: 00 00 00 00 c0 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.424206] 00000020: 00 00 00 42 00 00 22 00 00 00 00 00 ee 84 50 fe
Jun 2 06:29:39 138224 kernel: [934519.424207] 00000030: 00 00 02 f8 00 00 22 00 00 00 00 00 f4 41 2d 08
Jun 2 06:29:39 138224 kernel: [934519.424208] 00000040: 00 00 02 74 00 00 22 00 00 00 00 00 ef 67 10 00
Jun 2 06:29:39 138224 kernel: [934519.424211] 00000050: e7 d4 00 00 01 01 08 0a 66 55 ef f7 22 a1 a2 2b
Jun 2 06:29:39 138224 kernel: [934519.424214] 00000060: 00 00 0b 50 00 00 22 00 00 00 00 00 f3 4a d0 b5
Jun 2 06:29:39 138224 kernel: [934519.424217] 00000070: ea 5f 40 00 40 06 15 d3 32 07 ee 1a 05 3b 04 25
Jun 2 06:29:39 138224 kernel: [934519.506293] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.506295] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.506300] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.506302] 00000030: 00 00 00 00 30 10 68 02 29 00 02 83 00 6d 2b d2
Jun 2 06:29:39 138224 kernel: [934519.506306] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x6d, len: 64
Jun 2 06:29:39 138224 kernel: [934519.506307] 00000000: 00 00 6d 29 00 02 83 02 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.506308] 00000010: 00 00 00 00 c0 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.506311] 00000020: c7 7d d9 96 0c 42 a1 0a 30 92 08 00 45 00 0a bc
Jun 2 06:29:39 138224 kernel: [934519.506314] 00000030: 28 a9 40 00 40 06 a1 50 32 07 ee 1a b0 3b 95 e5
Jun 2 06:29:39 138224 kernel: [934519.562513] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.562515] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.562517] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.562524] 00000030: 00 00 00 00 30 10 68 02 29 00 02 83 00 00 5d d2
Jun 2 06:29:39 138224 kernel: [934519.562528] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x0, len: 64
Jun 2 06:29:39 138224 kernel: [934519.562532] 00000000: 00 00 00 29 00 02 83 02 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.562533] 00000010: 00 00 00 00 c0 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.562535] 00000020: c7 7d d9 96 0c 42 a1 0a 30 92 08 00 45 00 0b 84
Jun 2 06:29:39 138224 kernel: [934519.562536] 00000030: d1 d5 40 00 40 06 8b b3 32 07 ee 1a 4d de 63 eb
Jun 2 06:29:39 138224 kernel: [934519.630932] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.630934] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.630935] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.630939] 00000030: 00 00 00 00 30 10 68 02 29 00 02 83 00 00 54 d2
Jun 2 06:29:39 138224 kernel: [934519.630945] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x0, len: 64
Jun 2 06:29:39 138224 kernel: [934519.630951] 00000000: 00 00 00 29 00 02 83 02 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.630956] 00000010: 00 00 00 00 c0 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.630957] 00000020: c7 7d d9 96 0c 42 a1 0a 30 92 08 00 45 00 0a bc
Jun 2 06:29:39 138224 kernel: [934519.630958] 00000030: dd 29 40 00 40 06 ea 4a 32 07 ee 1a 55 73 f3 32
Jun 2 06:29:39 138224 kernel: [934519.659629] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.659631] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.659633] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jun 2 06:29:39 138224 kernel: [934519.659635] 00000030: 00 00 00 00 30 10 68 02 29 00 02 83 00 00 ae d2
Jun 2 06:29:39 138224 kernel: [934519.659637] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x0, len: 64
after that
Jun 2 06:29:40 138224 kernel: [934519.921473] ------------[ cut here ]------------
Jun 2 06:29:40 138224 kernel: [934519.921495] WARNING: CPU: 84 PID: 0 at drivers/iommu/iova.c:817 iova_magazine_free_pfns.part.13.cold.23+0x8/0xf
Jun 2 06:29:40 138224 kernel: [934519.921496] Modules linked in: fuse btrfs zstd_compress zstd_decompress xxhash ufs qnx4 hfsplus hfs minix vfat msdos fat jfs xfs dm_mod binfmt_misc msr mst_pcicon
f(OE) amd64_edac_mod edac_mce_amd kvm_amd kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr aufs(OE) ast ttm drm_kms_helper drm joydev i2c_algo_bit evdev sg ccp rng_c
ore sp5100_tco ipmi_si ipmi_devintf ipmi_msghandler pcc_cpufreq acpi_cpufreq button tcp_bbr sch_fq bonding lp parport loop ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb raid10 rai
d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear hid_generic usbhid hid raid1 md_mod sd_mod crc32c_intel aesni_intel aes_
x86_64 crypto_simd cryptd glue_helper ahci xhci_pci libahci
Jun 2 06:29:40 138224 kernel: [934519.921547] xhci_hcd libata mlx5_core(OE) nvme mlxfw(OE) nvme_core scsi_mod usbcore mlx_compat(OE) devlink i2c_piix4 usb_common [last unloaded: mst_pci]
Jun 2 06:29:40 138224 kernel: [934519.921557] CPU: 84 PID: 0 Comm: swapper/84 Tainted: G OE 4.19.0-16-amd64 #1 Debian 4.19.181-1
Jun 2 06:29:40 138224 kernel: [934519.921558] Hardware name: Supermicro AS -2124BT-HNTR/H12DST-B, BIOS 1.1 01/10/2020
Jun 2 06:29:40 138224 kernel: [934519.921560] RIP: 0010:iova_magazine_free_pfns.part.13.cold.23+0x8/0xf
What to do?