ConnectX-5 SR-IOV TX timeout on queue

francois.rigault · November 24, 2024, 7:53am

hi Nvidia
We run Linux VMs on Azure North Europe on arm vmsize. The platform exposes a CX5 VF nic for “accelerated networking”.
We seem to have an issue impacting some of our deployments, where network is glitching and Linux will be logging in a loop:

[  228.330181] mlx5_core 58d8:00:02.0 eth1: TX timeout detected
[  228.330248] mlx5_core 58d8:00:02.0 eth1: TX timeout on queue: 1, SQ: 0x19e, CQ: 0xa7a, SQ Cons: 0x17 SQ Prod: 0x28, usecs since last trans: 28410000
[  228.330263] mlx5_core 58d8:00:02.0 eth1: EQ 0x8: Cons = 0x36, irqn = 0x11
[  228.330450] mlx5_core 58d8:00:02.0 eth1: Recovered 2 eqes on EQ 0x8

This seems to happen on a particular queue (here we see queue 1 and it will always be queue 1 in the logs and in further reboots).

# ethtool -S eth1 | grep rearm
     ch_eq_rearm: 4
     ch0_eq_rearm: 0
     ch1_eq_rearm: 4
     ch2_eq_rearm: 0
     ch3_eq_rearm: 0
     ch4_eq_rearm: 0
     ch5_eq_rearm: 0
     ch6_eq_rearm: 0
     ch7_eq_rearm: 0

# ethtool -i eth1
driver: mlx5_core
version: 5.14.0-284.62.1.el9_2.aarch64
firmware-version: 16.30.1284 (MSF0000000012)
expansion-rom-version: 
bus-info: 58d8:00:02.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

# devlink health
pci/58d8:00:02.0:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 1200000 auto_recover true auto_dump true
pci/58d8:00:02.0/327680:
  reporter tx
    state healthy error 4 recover 4 grace_period 500 auto_recover true auto_dump true
  reporter rx
    state healthy error 0 recover 0 grace_period 500 auto_recover true auto_dump true

Linux vm1 5.14.0-284.62.1.el9_2.aarch64 #1 SMP PREEMPT_DYNAMIC Fri Apr 5 15:07:56 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
(but we tried various Linux kernel versions even the latest one and the issue persists)

The issue stops after redeploying the VM and some luck.
It also stops when running ethtool -L eth1 combined 1 (to ensure a single queue is in use).
It does not seem to be linked to the workload either, as I had the issue on freshly deployed VMs, running on different hypervisors.

Tickets are opened to vendors but so far nothing is seen on their side, the hypervisor looks healthy.
Any guidance is welcome.

francois.rigault · December 9, 2024, 2:22pm

Apparently an issue with ARM-based Ampere Altra VMs, that is being addressed by Azure.

system · December 23, 2024, 2:23pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mellanox x6 problem out of queu timeout Mellanox OFED	1	16	December 13, 2024
Mlx5_core 0000:04:00.0 enp4s0f0np0: TX timeout detected BlueField	0	76	November 12, 2024
CX3 SR-IOV: TX timeout on queue: 4 InfiniBand/VPI Adapter Cards	2	1193	May 21, 2020
emmc failure on multiple boards Jetson TX2	8	2062	October 18, 2021
ConnectX6 (mlx5 kernel driver) strange behavior? Ethernet Adapter Cards kernel , ubuntu	2	2712	September 14, 2022
AMD 7302, MCX456A-FCAT, CentOS 7.8 - IB won't link, ibutils hang, reads of sysfs timeout Mellanox OFED	2	454	February 21, 2024
Driving External ConnectX7 without Reset Attached Jetson AGX Orin pcie , ethernet	3	307	May 28, 2024
Network connection loss when TX ring full Jetson AGX Xavier ethernet	18	2952	October 18, 2021
No network communication with RGMII on Custom board for Orin AGX Industrial Jetson AGX Orin board-design , device-tree , ethernet	12	128	September 10, 2024
The system becomes very slow, while running netwoking tasks on jetson TX2 Jetson TX2 kernel	6	585	October 19, 2023

ConnectX-5 SR-IOV TX timeout on queue

Related topics