ConnectX-5 SR-IOV TX timeout on queue

hi Nvidia
We run Linux VMs on Azure North Europe on arm vmsize. The platform exposes a CX5 VF nic for “accelerated networking”.
We seem to have an issue impacting some of our deployments, where network is glitching and Linux will be logging in a loop:

[  228.330181] mlx5_core 58d8:00:02.0 eth1: TX timeout detected
[  228.330248] mlx5_core 58d8:00:02.0 eth1: TX timeout on queue: 1, SQ: 0x19e, CQ: 0xa7a, SQ Cons: 0x17 SQ Prod: 0x28, usecs since last trans: 28410000
[  228.330263] mlx5_core 58d8:00:02.0 eth1: EQ 0x8: Cons = 0x36, irqn = 0x11
[  228.330450] mlx5_core 58d8:00:02.0 eth1: Recovered 2 eqes on EQ 0x8

This seems to happen on a particular queue (here we see queue 1 and it will always be queue 1 in the logs and in further reboots).

# ethtool -S eth1 | grep rearm
     ch_eq_rearm: 4
     ch0_eq_rearm: 0
     ch1_eq_rearm: 4
     ch2_eq_rearm: 0
     ch3_eq_rearm: 0
     ch4_eq_rearm: 0
     ch5_eq_rearm: 0
     ch6_eq_rearm: 0
     ch7_eq_rearm: 0

# ethtool -i eth1
driver: mlx5_core
version: 5.14.0-284.62.1.el9_2.aarch64
firmware-version: 16.30.1284 (MSF0000000012)
expansion-rom-version: 
bus-info: 58d8:00:02.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

# devlink health
pci/58d8:00:02.0:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 1200000 auto_recover true auto_dump true
pci/58d8:00:02.0/327680:
  reporter tx
    state healthy error 4 recover 4 grace_period 500 auto_recover true auto_dump true
  reporter rx
    state healthy error 0 recover 0 grace_period 500 auto_recover true auto_dump true

Linux vm1 5.14.0-284.62.1.el9_2.aarch64 #1 SMP PREEMPT_DYNAMIC Fri Apr 5 15:07:56 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
(but we tried various Linux kernel versions even the latest one and the issue persists)

The issue stops after redeploying the VM and some luck.
It also stops when running ethtool -L eth1 combined 1 (to ensure a single queue is in use).
It does not seem to be linked to the workload either, as I had the issue on freshly deployed VMs, running on different hypervisors.

Tickets are opened to vendors but so far nothing is seen on their side, the hypervisor looks healthy.
Any guidance is welcome.

Apparently an issue with ARM-based Ampere Altra VMs, that is being addressed by Azure.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.