Hello, we are seeing a lot of softlockups in mellanox driver. Just wondering if its a known issue or what might be causing it. it often happens after this error.
kernel: [736550.054087] mlx5_core 0000:41:00.1 mcx3p1: Failed to get min RX wqes on Channel[20] RQN[0x26e9] wq cur_sz(1) min_rx_wqes(2)
Jun 30 00:21:57 bcn01-data01 kernel: [736550.054091] mlx5_core 0000:41:00.1 mcx3p1: RX timeout on channel: 20, ICOSQ: 0x26e7 RQ: 0x26e9, CQ: 0x48d
Jun 30 00:21:57 bcn01-data01 kernel: [736550.065995] mlx5_core 0000:41:00.1 mcx3p1: EQ 0x1b: Cons = 0x8ca620b, irqn = 0xc7
and EIP always points to “mlx5e_poll_ico_cq+0xda/0x380” function.
we are using Debian.
Linux host 5.10.0-7-amd64 #1 SMP Debian 5.10.40-1 (2021-05-28) x86_64 GNU/Linux
after this Soft lockup occurs and does not recover until the system is rebooted.
Jun 30 00:21:58 bcn01-data01 kernel: [736551.054043] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068617] (detected by 5, t=5255 jiffies, g=162350793, q=15707)
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068619] Sending NMI from CPU 5 to CPUs 20:
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068729] NMI backtrace for cpu 20
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068730] CPU: 20 PID: 0 Comm: swapper/20 Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068731] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.0.3 01/15/2021
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068731] RIP: 0010:mlx5e_poll_ico_cq+0xda/0x380 [mlx5_core]
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068732] Code: 41 c1 c6 08 eb 1c 0f b6 08 84 c9 74 12 80 f9 01 0f 84 e6 00 00 00 80 3d f1 a8 0a 00 00 74 4f 41 89 ec 44 89 e0 21 f0 0f b7 c0 <48> c1 e0 04 48 01 f8 0f b6 50 01 42 8d 2c 22 66 45 39 f4 75 c7 41
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068732] RSP: 0018:ffff9b97806fce70 EFLAGS: 00000206
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068733] RAX: 000000000000000a RBX: ffff88bd9d973480 RCX: 0000000000000000
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068734] RDX: 0000000000000000 RSI: 000000000000007f RDI: ffff88bcd9e48000
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068734] RBP: 000000000000000a R08: 0000000000000001 R09: ffff88db7f32cb80
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068734] R10: 0000000000000048 R11: ffffffffa66060c0 R12: 000000000000000a
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068734] R13: ffff88bce93f4040 R14: 0000000000000001 R15: 0000000000000000
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068735] FS: 0000000000000000(0000) GS:ffff88db7f300000(0000) knlGS:0000000000000000
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068735] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068735] CR2: 000000c05a23b000 CR3: 0000000179f68000 CR4: 0000000000350ee0
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068735] Call Trace:
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068736] <IRQ>
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068736] mlx5e_napi_poll+0xe9/0x670 [mlx5_core]
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068736] ? mlx5e_completion_event+0x3c/0x40 [mlx5_core]
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068736] net_rx_action+0x145/0x3e0
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] __do_softirq+0xc5/0x275
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] asm_call_irq_on_stack+0x12/0x20
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] </IRQ>
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] do_softirq_own_stack+0x37/0x40
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068737] irq_exit_rcu+0x8e/0xc0
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068738] common_interrupt+0x74/0x130
Jun 30 00:21:58 bcn01-data01 kernel: [736551.068738] asm_common_interrupt+0x1e/0x40