Linux: Soft Lockup in Mellanox driver.

Hello, we are seeing a lot of softlockups in mellanox driver. Just wondering if its a known issue or what might be causing it. it often happens after this error.

kernel: [736550.054087] mlx5_core 0000:41:00.1 mcx3p1: Failed to get min RX wqes on Channel[20] RQN[0x26e9] wq cur_sz(1) min_rx_wqes(2)

Jun 30 00:21:57 bcn01-data01 kernel: [736550.054091] mlx5_core 0000:41:00.1 mcx3p1: RX timeout on channel: 20, ICOSQ: 0x26e7 RQ: 0x26e9, CQ: 0x48d

Jun 30 00:21:57 bcn01-data01 kernel: [736550.065995] mlx5_core 0000:41:00.1 mcx3p1: EQ 0x1b: Cons = 0x8ca620b, irqn = 0xc7

and EIP always points to “mlx5e_poll_ico_cq+0xda/0x380” function.

we are using Debian.

Linux host 5.10.0-7-amd64 #1 SMP Debian 5.10.40-1 (2021-05-28) x86_64 GNU/Linux

after this Soft lockup occurs and does not recover until the system is rebooted.

Jun 30 00:21:58 bcn01-data01 kernel: [736551.054043] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:


Jun 30 00:21:58 bcn01-data01 kernel: [736551.068617] (detected by 5, t=5255 jiffies, g=162350793, q=15707)

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068619] Sending NMI from CPU 5 to CPUs 20:

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068729] NMI backtrace for cpu 20

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068730] CPU: 20 PID: 0 Comm: swapper/20 Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068731] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.0.3 01/15/2021

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068731] RIP: 0010:mlx5e_poll_ico_cq+0xda/0x380 [mlx5_core]

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068732] Code: 41 c1 c6 08 eb 1c 0f b6 08 84 c9 74 12 80 f9 01 0f 84 e6 00 00 00 80 3d f1 a8 0a 00 00 74 4f 41 89 ec 44 89 e0 21 f0 0f b7 c0 <48> c1 e0 04 48 01 f8 0f b6 50 01 42 8d 2c 22 66 45 39 f4 75 c7 41

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068732] RSP: 0018:ffff9b97806fce70 EFLAGS: 00000206

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068733] RAX: 000000000000000a RBX: ffff88bd9d973480 RCX: 0000000000000000

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068734] RDX: 0000000000000000 RSI: 000000000000007f RDI: ffff88bcd9e48000

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068734] RBP: 000000000000000a R08: 0000000000000001 R09: ffff88db7f32cb80

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068734] R10: 0000000000000048 R11: ffffffffa66060c0 R12: 000000000000000a

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068734] R13: ffff88bce93f4040 R14: 0000000000000001 R15: 0000000000000000

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068735] FS: 0000000000000000(0000) GS:ffff88db7f300000(0000) knlGS:0000000000000000

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068735] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068735] CR2: 000000c05a23b000 CR3: 0000000179f68000 CR4: 0000000000350ee0

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068735] Call Trace:

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068736] <IRQ>

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068736] mlx5e_napi_poll+0xe9/0x670 [mlx5_core]

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068736] ? mlx5e_completion_event+0x3c/0x40 [mlx5_core]

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068736] net_rx_action+0x145/0x3e0

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] __do_softirq+0xc5/0x275

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] asm_call_irq_on_stack+0x12/0x20

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] </IRQ>

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] do_softirq_own_stack+0x37/0x40

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] irq_exit_rcu+0x8e/0xc0

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068738] common_interrupt+0x74/0x130

Jun 30 00:21:58 bcn01-data01 kernel: [736551.068738] asm_common_interrupt+0x1e/0x40

Hi,

Nothing that pops up with this prints. First, be sure you are using latest HCA firmware.

If not using Mellanox OFED, raise discussion with OS vendor if using official vendor kernel or on the kernel forum if using the one from kernel.org

If using AMD CPU, double check that you have iommu=pt in grub configuration.

If using Mellanox OFED and latest firmware you might open an official support ticket, however you or your organization must have a valid support contract with Nvidia.