CX3 SR-IOV: TX timeout on queue: 4

I’m running a MCX354A-FCBT on an Epyc 7502 in SR-IOV mode and the one guest that’s using a VF is frequently suffering from network glitches where connections hang. When these glitches occur I see console log entries like the following:

[77070.892171] mlx4_en: eth0: TX timeout on queue: 4, QP: 0xb94, CQ: 0xd6, Cons: 0xffffffff, Prod: 0x3d8

[79094.819061] mlx4_en: eth0: TX timeout on queue: 4, QP: 0xb94, CQ: 0xd6, Cons: 0xffffffff, Prod: 0x3d7

Every occurrence has been queue 4. This card has been in service for quite some time with no issues. I have only recently enabled SR-IOV and started using VFs and the only problems I’ve encountered have been in the one guest that is using a VF.

Could this be a bad card? Or a driver bug?

Hi Larkin,

What is the driver version installed on the baremetal & affected guest?

What is the firmware version on the baremetal?

Please make sure to use latest MLNX_OFED and firmware.

MLNX_OFED 5.0-2.1.8.0:

https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed

Latest firmware for MCX354A-FCBT is v2.42.5000:

https://www.mellanox.com/support/firmware/connectx3ib

Regards,

Chen

I swapped out the card for one of the identical model. It worked error free for almost 2 weeks. This card is now failing on queue 2.

baremetal driver: should be OFED, mlx4_core: Mellanox ConnectX core driver v5.0-2.1.8

guest driver: should be inbox, mlx4_core: Mellanox ConnectX core driver v4.0-0

baremetal firmware: 2.42.5000

I have not been able to install the OFED driver on the guest since guest is running kernel 5.6.13 and the OFED download doesn’t seem to support that.

More detail on the failure:

[98130.587234] ------------[ cut here ]------------

[98130.587246] NETDEV WATCHDOG: enp1s0 (mlx4_core): transmit queue 2 timed out

[98130.587272] WARNING: CPU: 4 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x25c/0x270

[98130.587274] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set rfkill nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter sunrpc snd_hda_codec_generic iTCO_wdt crct10dif_pclmul ledtrig_audio iTCO_vendor_support crc32_pclmul snd_hda_intel snd_intel_dspcfg snd_hda_codec ghash_clmulni_intel snd_hda_core snd_hwdep snd_pcm snd_timer i2c_i801 joydev snd lpc_ich virtio_balloon soundcore ip_tables xfs libcrc32c mlx4_en qxl drm_ttm_helper ttm crc32c_intel drm_kms_helper serio_raw drm mlx4_core virtio_console aacraid virtio_scsi qemu_fw_cfg

[98130.587304] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.6.13-300.fc32.x86_64 #1

[98130.587305] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014

[98130.587307] RIP: 0010:dev_watchdog+0x25c/0x270

[98130.587308] Code: 85 c0 75 e5 eb 9a 4c 89 f7 c6 05 ee b1 f3 00 01 e8 f9 36 fb ff 44 89 e9 4c 89 f6 48 c7 c7 e0 76 42 b2 48 89 c2 e8 0b 4c 80 ff <0f> 0b e9 78 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f

[98130.587309] RSP: 0018:ffffb3b600158e60 EFLAGS: 00010286

[98130.587310] RAX: 000000000000003f RBX: ffff99362af13ec0 RCX: 000000000000083f

[98130.587311] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f

[98130.587311] RBP: ffff99362b2c03dc R08: 0000000000000414 R09: 0000000000000003

[98130.587312] R10: 0000000000000000 R11: 0000000000000001 R12: ffff99362b2c0480

[98130.587312] R13: 0000000000000002 R14: ffff99362b2c0000 R15: ffff99362af13f40

[98130.587315] FS: 0000000000000000(0000) GS:ffff993637d00000(0000) knlGS:0000000000000000

[98130.587315] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[98130.587316] CR2: 0000000002a68000 CR3: 0000000272bc8000 CR4: 0000000000340ee0

[98130.587318] Call Trace:

[98130.587332]

[98130.587337] ? pfifo_fast_enqueue+0x150/0x150

[98130.587340] call_timer_fn+0x2d/0x130

[98130.587342] __run_timers.part.0+0x167/0x240

[98130.587344] ? tick_sched_handle+0x22/0x60

[98130.587345] ? tick_sched_timer+0x38/0x80

[98130.587346] ? tick_sched_do_timer+0x70/0x70

[98130.587347] ? __hrtimer_run_queues+0x128/0x280

[98130.587348] run_timer_softirq+0x26/0x50

[98130.587350] __do_softirq+0xe9/0x2dc

[98130.587354] irq_exit+0xcf/0x110

[98130.587355] smp_apic_timer_interrupt+0x78/0x130

[98130.587357] apic_timer_interrupt+0xf/0x20

[98130.587358]

[98130.587359] RIP: 0010:native_safe_halt+0xe/0x10

[98130.587360] Code: 02 20 48 8b 00 a8 08 75 c4 e9 7b ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d c6 52 5d 00 fb f4 90 e9 07 00 00 00 0f 00 2d b6 52 5d 00 f4 c3 cc cc 0f 1f 44 00

[98130.587360] RSP: 0018:ffffb3b600087ee8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13

[98130.587361] RAX: ffffffffb1a332d0 RBX: ffff993636e80000 RCX: 0000000000000001

[98130.587361] RDX: 0000000000000004 RSI: 0000000000000087 RDI: 0000000000000004

[98130.587362] RBP: 0000000000000004 R08: 0000596d2341e0f8 R09: 0000000000000206

[98130.587362] R10: 00000000000003cf R11: 0000000000000004 R12: 0000000000000000

[98130.587363] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

[98130.587364] ? __sched_text_end+0x1/0x1

[98130.587368] default_idle+0x1a/0x140

[98130.587370] do_idle+0x1cb/0x240

[98130.587371] cpu_startup_entry+0x19/0x20

[98130.587374] secondary_startup_64+0xb6/0xc0

[98130.587375] —[ end trace 4e783c315ea412f8 ]—

[98130.587381] mlx4_en: enp1s0: TX timeout on queue: 2, QP: 0xada, CQ: 0x10d, Cons: 0x2ef1c877, Prod: 0x2ef1cc50

[98130.624760] mlx4_en: enp1s0: Steering Mode 2

[98161.818934] mlx4_en: enp1s0: TX timeout on queue: 2, QP: 0xada, CQ: 0x10d, Cons: 0xffffffff, Prod: 0x1

[98161.856471] mlx4_en: enp1s0: Steering Mode 2

[98193.562584] mlx4_en: enp1s0: TX timeout on queue: 2, QP: 0xada, CQ: 0x10d, Cons: 0xffffffff, Prod: 0x1

[98161.856471] mlx4_en: enp1s0: Steering Mode 2

The bad queue has an odd interrupt distribution. The interrupt for mlx4-3@0000:01:00.0 is only on a single core but all other interrupts are evenly distributed.

48: 130563 87012 0 98399 122422 98439 113135 160861 PCI-MSI 524288-edge mlx4-async@pci:0000:01:00.0

49: 25974603 10358768 0 15163597 13956099 12813314 19764044 19329656 PCI-MSI 524289-edge mlx4-1@0000:01:00.0

50: 16578157 47293793 0 16976161 12804070 21937404 17343603 18904029 PCI-MSI 524290-edge mlx4-2@0000:01:00.0

51: 0 0 2799451354 0 0 0 0 0 PCI-MSI 524291-edge mlx4-3@0000:01:00.0

52: 22101342 15736874 0 35473195 12899071 17306282 15653647 17628464 PCI-MSI 524292-edge mlx4-4@0000:01:00.0

53: 13120912 13195733 0 16532228 43754798 16317650 13354143 19197302 PCI-MSI 524293-edge mlx4-5@0000:01:00.0

54: 15014725 15035047 0 15969351 15202026 27884485 16546071 16131401 PCI-MSI 524294-edge mlx4-6@0000:01:00.0

55: 11877881 14653315 0 17391523 16801545 16305608 33268034 17278115 PCI-MSI 524295-edge mlx4-7@0000:01:00.0

56: 15152890 14510976 0 16195574 14855793 18292044 16535228 30567083 PCI-MSI 524296-edge mlx4-8@0000:01:00.0