MT28908 Family [ConnectX-6] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x63, len: 128

Hello! There is a problem. Today my server with mellanox 100G card stop working.

In messages many such entries:

Jun 2 06:29:39 138224 kernel: [934519.424191] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.424195] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.424198] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.424199] 00000030: 00 00 00 00 04 00 51 04 0a 00 02 83 b8 63 dc d2

Jun 2 06:29:39 138224 kernel: [934519.424202] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x63, len: 128

Jun 2 06:29:39 138224 kernel: [934519.424203] 00000000: 00 b8 63 0a 00 02 83 05 00 00 00 08 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.424205] 00000010: 00 00 00 00 c0 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.424206] 00000020: 00 00 00 42 00 00 22 00 00 00 00 00 ee 84 50 fe

Jun 2 06:29:39 138224 kernel: [934519.424207] 00000030: 00 00 02 f8 00 00 22 00 00 00 00 00 f4 41 2d 08

Jun 2 06:29:39 138224 kernel: [934519.424208] 00000040: 00 00 02 74 00 00 22 00 00 00 00 00 ef 67 10 00

Jun 2 06:29:39 138224 kernel: [934519.424211] 00000050: e7 d4 00 00 01 01 08 0a 66 55 ef f7 22 a1 a2 2b

Jun 2 06:29:39 138224 kernel: [934519.424214] 00000060: 00 00 0b 50 00 00 22 00 00 00 00 00 f3 4a d0 b5

Jun 2 06:29:39 138224 kernel: [934519.424217] 00000070: ea 5f 40 00 40 06 15 d3 32 07 ee 1a 05 3b 04 25

Jun 2 06:29:39 138224 kernel: [934519.506293] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.506295] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.506300] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.506302] 00000030: 00 00 00 00 30 10 68 02 29 00 02 83 00 6d 2b d2

Jun 2 06:29:39 138224 kernel: [934519.506306] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x6d, len: 64

Jun 2 06:29:39 138224 kernel: [934519.506307] 00000000: 00 00 6d 29 00 02 83 02 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.506308] 00000010: 00 00 00 00 c0 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.506311] 00000020: c7 7d d9 96 0c 42 a1 0a 30 92 08 00 45 00 0a bc

Jun 2 06:29:39 138224 kernel: [934519.506314] 00000030: 28 a9 40 00 40 06 a1 50 32 07 ee 1a b0 3b 95 e5

Jun 2 06:29:39 138224 kernel: [934519.562513] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.562515] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.562517] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.562524] 00000030: 00 00 00 00 30 10 68 02 29 00 02 83 00 00 5d d2

Jun 2 06:29:39 138224 kernel: [934519.562528] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x0, len: 64

Jun 2 06:29:39 138224 kernel: [934519.562532] 00000000: 00 00 00 29 00 02 83 02 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.562533] 00000010: 00 00 00 00 c0 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.562535] 00000020: c7 7d d9 96 0c 42 a1 0a 30 92 08 00 45 00 0b 84

Jun 2 06:29:39 138224 kernel: [934519.562536] 00000030: d1 d5 40 00 40 06 8b b3 32 07 ee 1a 4d de 63 eb

Jun 2 06:29:39 138224 kernel: [934519.630932] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.630934] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.630935] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.630939] 00000030: 00 00 00 00 30 10 68 02 29 00 02 83 00 00 54 d2

Jun 2 06:29:39 138224 kernel: [934519.630945] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x0, len: 64

Jun 2 06:29:39 138224 kernel: [934519.630951] 00000000: 00 00 00 29 00 02 83 02 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.630956] 00000010: 00 00 00 00 c0 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.630957] 00000020: c7 7d d9 96 0c 42 a1 0a 30 92 08 00 45 00 0a bc

Jun 2 06:29:39 138224 kernel: [934519.630958] 00000030: dd 29 40 00 40 06 ea 4a 32 07 ee 1a 55 73 f3 32

Jun 2 06:29:39 138224 kernel: [934519.659629] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.659631] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.659633] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jun 2 06:29:39 138224 kernel: [934519.659635] 00000030: 00 00 00 00 30 10 68 02 29 00 02 83 00 00 ae d2

Jun 2 06:29:39 138224 kernel: [934519.659637] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x0, len: 64

after that

Jun 2 06:29:40 138224 kernel: [934519.921473] ------------[ cut here ]------------

Jun 2 06:29:40 138224 kernel: [934519.921495] WARNING: CPU: 84 PID: 0 at drivers/iommu/iova.c:817 iova_magazine_free_pfns.part.13.cold.23+0x8/0xf

Jun 2 06:29:40 138224 kernel: [934519.921496] Modules linked in: fuse btrfs zstd_compress zstd_decompress xxhash ufs qnx4 hfsplus hfs minix vfat msdos fat jfs xfs dm_mod binfmt_misc msr mst_pcicon

f(OE) amd64_edac_mod edac_mce_amd kvm_amd kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr aufs(OE) ast ttm drm_kms_helper drm joydev i2c_algo_bit evdev sg ccp rng_c

ore sp5100_tco ipmi_si ipmi_devintf ipmi_msghandler pcc_cpufreq acpi_cpufreq button tcp_bbr sch_fq bonding lp parport loop ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb raid10 rai

d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear hid_generic usbhid hid raid1 md_mod sd_mod crc32c_intel aesni_intel aes_

x86_64 crypto_simd cryptd glue_helper ahci xhci_pci libahci

Jun 2 06:29:40 138224 kernel: [934519.921547] xhci_hcd libata mlx5_core(OE) nvme mlxfw(OE) nvme_core scsi_mod usbcore mlx_compat(OE) devlink i2c_piix4 usb_common [last unloaded: mst_pci]

Jun 2 06:29:40 138224 kernel: [934519.921557] CPU: 84 PID: 0 Comm: swapper/84 Tainted: G OE 4.19.0-16-amd64 #1 Debian 4.19.181-1

Jun 2 06:29:40 138224 kernel: [934519.921558] Hardware name: Supermicro AS -2124BT-HNTR/H12DST-B, BIOS 1.1 01/10/2020

Jun 2 06:29:40 138224 kernel: [934519.921560] RIP: 0010:iova_magazine_free_pfns.part.13.cold.23+0x8/0xf

What to do?

It is difficult to conclude that failure happens in ConnectX-6 code. WQDUMP are informational messages. Check the log/dmesg and see if there are any errors related to mlx5 driver.

The error itself comes from “drivers/iommu/iova.c:817” code, that is not a Mellanox area

Be sure you are using latest Mellanox OFED GA v5.3 and the firmware. In the case of AMD platform, if the issue is reproducible, check the tuning guide including grub configuration - https://www.amd.com/system/files/TechDocs/56224.pdf

Алексей, спасибо за Ваш ответ! Прочитал рекомендации, немного дополнил свой стартапскрипт, сейчас он выглядит так:

#!/bin/bash

set_irq_affinity.sh eth0

mlnx_tune -p HIGH_THROUGHPUT

tuned-adm profile throughput-performance

ethtool -C eth0 adaptive-rx off adaptive-tx off

ethtool -K eth0 lro on

ethtool -G eth0 tx 8192 rx 8192

ethtool -C eth0 rx-usecs 0 rx-frames 10 tx-usecs 16 tx-frames 100

echo “mq-deadline” > /sys/block/nvme1n1/queue/scheduler

echo “mq-deadline” > /sys/block/nvme0n1/queue/scheduler

echo “mq-deadline” > /sys/block/nvme2n1/queue/scheduler

echo “mq-deadline” > /sys/block/nvme3n1/queue/scheduler

echo “mq-deadline” > /sys/block/nvme4n1/queue/scheduler

echo “mq-deadline” > /sys/block/nvme5n1/queue/scheduler

Расскидывание интераптов по NUMA делает хуже, потому что ядер 4, а очередей 64. Все 4 ядра NUMA node4 сразужестают загруженными IRQ 100% истановится только хуже.

n# lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

Address sizes: 43 bits physical, 48 bits virtual

CPU(s): 128

On-line CPU(s) list: 0-127

Thread(s) per core: 1

Core(s) per socket: 64

Socket(s): 2

NUMA node(s): 32

Vendor ID: AuthenticAMD

CPU family: 23

Model: 49

Model name: AMD EPYC 7742 64-Core Processor

Stepping: 0

CPU MHz: 3249.788

CPU max MHz: 2250,0000

CPU min MHz: 1500,0000

BogoMIPS: 4500.17

Virtualization: AMD-V

L1d cache: 32K

L1i cache: 32K

L2 cache: 512K

L3 cache: 16384K

NUMA node0 CPU(s): 0-3

NUMA node1 CPU(s): 4-7

NUMA node2 CPU(s): 8-11

NUMA node3 CPU(s): 12-15

NUMA node4 CPU(s): 16-19

NUMA node5 CPU(s): 20-23

NUMA node6 CPU(s): 24-27

NUMA node7 CPU(s): 28-31

NUMA node8 CPU(s): 32-35

NUMA node9 CPU(s): 36-39

NUMA node10 CPU(s): 40-43

NUMA node11 CPU(s): 44-47

NUMA node12 CPU(s): 48-51

NUMA node13 CPU(s): 52-55

NUMA node14 CPU(s): 56-59

NUMA node15 CPU(s): 60-63

NUMA node16 CPU(s): 64-67

NUMA node17 CPU(s): 68-71

NUMA node18 CPU(s): 72-75

NUMA node19 CPU(s): 76-79

NUMA node20 CPU(s): 80-83

NUMA node21 CPU(s): 84-87

NUMA node22 CPU(s): 88-91

NUMA node23 CPU(s): 92-95

NUMA node24 CPU(s): 96-99

NUMA node25 CPU(s): 100-103

NUMA node26 CPU(s): 104-107

NUMA node27 CPU(s): 108-111

NUMA node28 CPU(s): 112-115

NUMA node29 CPU(s): 116-119

NUMA node30 CPU(s): 120-123

NUMA node31 CPU(s): 124-127

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

cat /sys/class/net/eth0/device/numa_node

4

Если просто выполнять set_irq_affinity.sh eth0

IRQ 100% вылазитна CPU000

Если выполнить set_irq_affinity_cpulist.sh 64-127 eth0

IRQ 100% вылазит на CPU068 (смотрите вложение)

Если выполнить set_irq_affinity_cpulist.sh 8-71 eth0

IRQ 100% вылазит на CPU018

Firmware и driver последней версии:

ethtool -i eth0

driver: mlx5_core

version: 5.3-1.0.0

firmware-version: 20.30.1004 (MT_0000000225)

expansion-rom-version:

bus-info: 0000:41:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: no

supports-register-dump: no

supports-priv-flags: yes

Со всеми оптимизациями никуда не уходит проблема, что в один прекрасный момент накаком-то ядре становится загрузка 100% и трафик уходит вниз.

Почему такое неравномерное распредлеление irq по ядрам?