【arm64 server】A lot of "hw csum failure" seen in dmesg and console (using mlx5/Mellanox)

Issue

  • hw csum failure seen in dmesg and console (using mlx5/Mellanox)

I tried to switch between different Red Hat kernel versions and the problem continued.

The following versions were tested:
RHEL8.6, RHEL8.9, RHEL8.10,
RHEL9.0, RHEL9.3, RHEL9.4,

lspci & driver info:

[root@node-1 ~]# lspci -v| grep Mellanox
0000:01:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
0000:02:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
0000:02:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
[root@node-1 ~]# lspci -knn -s 0000:01:00.0
0000:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
        Subsystem: New H3C Technologies Co., Ltd. 620F-B [193d:100a]
        Kernel driver in use: mlx5_core
        Kernel modules: mlx5_core
[root@node-1 ~]#

[root@node-1 ~]# ethtool -i enp2s0f0
driver: mlx5_core
version: 4.18.0-372.19.1.es8_12_atomic.a
firmware-version: 14.32.1010 (H3C0010110034)
expansion-rom-version:
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
[root@node-1 ~]#

dmesg log:

[root@node-1 ~]# dmesg  | grep -i "hw csum"
[ 4487.138399] enp2s0f0: hw csum failure
[ 4487.377442] br-storagepub: hw csum failure
[ 5642.879109] br-storagepub: hw csum failure
[14223.286729] br-storage: hw csum failure
[18219.655892] br-storage: hw csum failure
[18295.442785] br-storagepub: hw csum failure
[20296.388030] enp2s0f0: hw csum failure
[20296.678840] br-storagepub: hw csum failure
[21396.368058] ens3f1: hw csum failure
[12924.138370] ens3f1: hw csum failure
[12924.141920] skb len=80 headroom=80 headlen=80 tailroom=32
               mac=(66,14) net=(80,20) trans=100
               shinfo(txflags=0 nr_frags=0 gso(size=0 type=0 segs=0))
               csum(0xb9f4 ip_summed=2 complete_sw=0 valid=0 level=0)
               hash(0x69762a92 sw=0 l4=1) proto=0x0800 pkttype=3 iif=0
[12924.170687] dev name=ens3f1 feat=0x0x10a1800214514ba9
[12924.175806] skb headroom: 00000000: 58 2c 13 33 b1 6c 9a aa f8 60 5d f4 46 1f d0 f7
[12924.183575] skb headroom: 00000010: 92 50 0d 1e df e1 08 9a a4 7f 6f d8 16 02 25 99
[12924.191340] skb headroom: 00000020: 18 76 5e 38 ae 8f 4b 0e 7b 44 52 9e d7 b2 7f 84
[12924.199108] skb headroom: 00000030: 8e ae fc 3f 04 8b 32 46 27 50 8a ec 65 47 62 0c
[12924.206873] skb headroom: 00000040: 00 00 86 8d da 70 33 4c ea c5 e7 40 f1 44 08 00
[12924.214633] skb linear:   00000000: 45 00 00 50 ad 9a 40 00 40 06 e9 b8 29 a8 28 02
[12924.222398] skb linear:   00000010: 29 a8 28 03 27 13 a5 36 61 75 7b dd 88 d5 4b b1
[12924.230162] skb linear:   00000020: f0 10 14 44 8d c0 00 00 01 01 08 0a 70 fe 43 80
[12924.237928] skb linear:   00000030: 89 07 6d 4c 01 01 05 1a 88 d6 44 91 88 d6 55 89
[12924.245687] skb linear:   00000040: 88 d6 33 99 88 d6 3e e9 88 d6 22 a1 88 d6 2d f1
[12924.253453] skb tailroom: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[12924.261218] skb tailroom: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[12924.268987] CPU: 21 PID: 13231 Comm: etcd Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-372.19.1.el8_6.aarch64 #1
[12924.281602] Hardware name: New H3C Technologies Co., Ltd. H3C UniServer R4970 G7/H3C UniServer R4970 G7, BIOS 7.60.09P91 08/09/2024 11:32
[12924.294125] Call trace:
[12924.296593]  dump_backtrace+0x0/0x158
[12924.300305]  show_stack+0x24/0x30
[12924.303655]  dump_stack+0x5c/0x74
[12924.307012]  netdev_rx_csum_fault.part.30+0x50/0x5c
[12924.311954]  __skb_gro_checksum_complete+0xc0/0xc8
[12924.316807]  tcp4_gro_receive+0xcc/0x1b0
[12924.320781]  inet_gro_receive+0x2a0/0x2f8
[12924.324838]  dev_gro_receive+0x250/0x750
[12924.328809]  napi_gro_receive+0x54/0x1d0
[12924.332779]  mlx5e_handle_rx_cqe+0x3b0/0x848 [mlx5_core]
[12924.338265]  mlx5e_poll_rx_cq+0xe4/0x8e0 [mlx5_core]
[12924.343382]  mlx5e_napi_poll+0x128/0x750 [mlx5_core]
[12924.348493]  __napi_poll+0x44/0x1b8
[12924.352022]  net_rx_action+0x2b4/0x330
[12924.355815]  __do_softirq+0x118/0x320
[12924.359520]  irq_exit_rcu+0x10c/0x120
[12924.363227]  irq_exit+0x14/0x20
[12924.366401]  __handle_domain_irq+0x70/0xc0
[12924.370551]  gic_handle_irq+0xd4/0x178
[12924.374344]  el0_irq_naked+0x50/0x58

I understand that this “hw csum failure” error means that the NIC hardware calculated the checksum incorrectly, right?

  1. Does this error indicate a hardware problem?
  2. What is the impact of this error? Do network packets have data corruption issues?