ConnectX-4 Lx hw csum failure panics on Ubuntu with kernel 5.6.18, 5.6.19, 5.7.9

I have encountered kernel panics on Ubuntu, kernel 5.6.0-1018-oem (5.6.18), 5.6.0-1020-oem (5.6.19) and 5.7.9.

System: RS500A-E10-RS12U

CPU: 1x AMD EPYC 7502

RAM: 512MB

[316294.820469] mlx5_core 0000:44:00.1 enp68s0f1: Error cqe on cqn 0x816, ci 0xc5, sqn 0x1908, opcode 0xd, syndrome 0x4, vendor syndrome 0x51

[316294.833103] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[316294.833106] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[316294.833110] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[316294.833116] 00000030: 00 00 00 00 04 00 51 04 0e 00 19 08 53 64 dc d2

[316294.833118] WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x364, len: 128

[316294.833120] 00000000: 00 53 64 0e 00 19 08 07 00 00 00 08 00 00 00 00

[316294.833121] 00000010: 00 00 00 00 c0 00 05 a0 00 00 00 00 00 42 00 a3

[316294.833123] 00000020: 8e bf 47 d7 86 14 ad f8 ef 46 08 00 45 00 12 34

[316294.833124] 00000030: 76 d8 40 00 40 06 77 97 c3 a8 4a 4a 5f 67 cc fa

[316294.833126] 00000040: 01 bb d8 2a 5c 7e 3d a0 b0 c5 3e 74 80 18 00 0b

[316294.833127] 00000050: 4c 7b 00 00 01 01 08 0a 63 59 a1 46 00 41 05 b4

[316294.833129] 00000060: 00 00 12 00 00 08 01 01 00 00 00 00 c2 c6 0b 74

[316294.833130] 00000070: 00 00 00 44 00 08 01 01 00 00 00 00 c3 09 6c fc

[316294.833144] mlx5_core 0000:44:00.1 enp68s0f1: ERR CQE on SQ: 0x1908

[316294.996328] enp68s0f1: hw csum failure

[316295.000262] skb len=1500 headroom=78 headlen=1500 tailroom=22

[316295.000262] mac=(64,14) net=(78,40) trans=118

[316295.000262] shinfo(txflags=0 nr_frags=0 gso(size=0 type=0 segs=0))

[316295.000262] csum(0x81a5 ip_summed=2 complete_sw=0 valid=0 level=0)

[316295.000262] hash(0x322a7dd7 sw=0 l4=1) proto=0x86dd pkttype=0 iif=0

[316295.029909] dev name=enp68s0f1 feat=0x0x0010a1821fd14ba9

[316295.943994] Hardware name: ASUSTeK COMPUTER INC. RS500A-E10-RS12U/KRPA-U16 Series, BIOS 0703 03/06/2020

[316295.943995] Call Trace:

[316295.943997]

[316295.944002] dump_stack+0x6d/0x9a

[316295.944006] netdev_rx_csum_fault.part.0+0x41/0x45

[316295.944007] __skb_gro_checksum_complete.cold+0xb/0x10

[316295.944009] tcp6_gro_receive+0xdc/0x1c0

[316295.944010] ipv6_gro_receive+0x1dc/0x460

[316295.944012] ? kmem_cache_alloc+0x16d/0x230

[316295.944017] dev_gro_receive+0x2fb/0x690

[316295.996284] ? mlx5e_build_rx_skb+0x38c/0xb60 [mlx5_core]

[316296.010778] napi_gro_receive+0x39/0x140

[316296.010793] mlx5e_handle_rx_cqe+0xa5/0x150 [mlx5_core]

[316296.010808] mlx5e_poll_rx_cq+0x7fe/0x910 [mlx5_core]

[316296.010825] mlx5e_napi_poll+0xda/0x610 [mlx5_core]

[316296.010843] ? mlx5_eq_comp_int+0x149/0x1b0 [mlx5_core]

[316296.010850] net_rx_action+0x13a/0x370

[316296.010859] __do_softirq+0xe1/0x2d6

[316296.010862] irq_exit+0xae/0xb0

[316296.010863] do_IRQ+0x5a/0xf0

[316296.010865] common_interrupt+0xf/0xf

[316296.010866]

[316296.010868] RIP: 0010:cpuidle_enter_state+0xca/0x3e0

[316296.010869] Code: ff e8 aa 7d 7e ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 ea 02 00 00 31 ff e8 2d 01 85 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 3f 02 00 00 49 63 d4 4c 8b 7d d0 4c 2b 7d c8 48 8d

[316296.010870] RSP: 0018:ffff9d84002cfe38 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda

[316296.010872] RAX: ffff91110b62ce00 RBX: ffff9110ac1d1c00 RCX: 000000000000001f

[316296.010872] RDX: 0000000000000000 RSI: 00000000334bfb91 RDI: 0000000000000000

[316296.010873] RBP: ffff9d84002cfe78 R08: 00011fab2ae67109 R09: 00011faebfd6b300

[316296.010873] R10: ffff91110b62bac4 R11: ffff91110b62baa4 R12: 0000000000000002

[316296.010874] R13: ffffffff8f978700 R14: 0000000000000002 R15: ffff9110ac1d1c00

[316296.010876] ? cpuidle_enter_state+0xa6/0x3e0

[316296.010878] cpuidle_enter+0x2e/0x40

[316296.010880] call_cpuidle+0x23/0x40

[316296.010881] do_idle+0x1e7/0x280

[316296.010882] cpu_startup_entry+0x20/0x30

[316296.010885] start_secondary+0x167/0x1c0

[316296.010886] secondary_startup_64+0xa4/0xb0

lspci -v -s 0000:44:00.1

44:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Flags: bus master, fast devsel, latency 0, IRQ 254, NUMA node 0

Memory at b0000000 (64-bit, prefetchable) [size=32M]

Expansion ROM at b5300000 [disabled] [size=1M]

Capabilities: [60] Express Endpoint, MSI 00

Capabilities: [48] Vital Product Data

Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-

Capabilities: [c0] Vendor Specific Information: Len=18 <?>

Capabilities: [40] Power Management version 3

Capabilities: [100] Advanced Error Reporting

Capabilities: [150] Alternative Routing-ID Interpretation (ARI)

Capabilities: [180] Single Root I/O Virtualization (SR-IOV)

Capabilities: [230] Access Control Services

Kernel driver in use: mlx5_core

Kernel modules: mlx5_core

Hi Martin,

Are you using the latest MLNX_OFED driver ?

https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed

If not . can you give it a try ?

Thanks,

Samer

We have tested it now with v5.0-2.1.8 OFED and 5.4.0-40-generic kernel. The error happened again the same way. So the driver and usage of stock kernel makes no difference here.

I have filed the bug on Ubuntu as:

https://bugs.launchpad.net/ubuntu/+source/linux-oem-5.6/+bug/1887723

A more complete syslog output is there:

https://bugs.launchpad.net/ubuntu/+source/linux-oem-5.6/+bug/1887723/+attachment/5392989/+files/syslog.txt

Hi Martin,

After you get an update from Ubuntu if you see the that the issue is related to Mellanox

please open support ticket to support@mellanox.com

Thanks,

Samer

I can confirm that we cannot reproduce this bug by specifying the “iommu=pt” kernel option at boot. We have x2APIC enabled.