Hi!
I have host with 2 mellanox x6 dx NICx in bonding and attached ebpf/xdp. My prog just return all packets back to port via XDP_TX()
Some times NICs goes down with messages
ACCESS_REG: cancaled on out of queu timeout(attach the sceen)
After that only hard reset via mlxconf return NICs back to work.
I tried different driver version, update FW to firmware version: 20.43.1014
log from dmesg looks like this
[Wed Dec 4 16:14:13 2024] ------------[ cut here ]------------
[Wed Dec 4 16:14:13 2024] NETDEV WATCHDOG: ens4np0 (mlx5_core): transmit queue 15 timed out 16000 ms
[Wed Dec 4 16:14:13 2024] WARNING: CPU: 108 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] Modules linked in: nf_tables nfnetlink vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd cuse 8021q garp mrp stp llc bonding rfkill vfat fat amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm wmi_bmof irqbypass rapl acpi_cpufreq pcspkr ipmi_ssif mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ipmi_si ipmi_devintf ptdma i2c_piix4 k10temp ipmi_msghandler joydev auth_rpcgss fuse drm sunrpc xfs libcrc32c mlx5_ib ib_uverbs ib_core sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mlx5_core ahci mlxfw libahci ixgbe psample libata mdio tls ccp dca pci_hyperv_intf sp5100_tco wmi dm_mirror dm_region_hash dm_log dm_mod xpmem(OE)
[Wed Dec 4 16:14:13 2024] CPU: 108 PID: 0 Comm: swapper/108 Kdump: loaded Tainted: G OE ------- — 5.14.0-427.31.1.el9_4.x86_64 #1
[Wed Dec 4 16:14:13 2024] Hardware name: Lenovo ThinkSystem SR665/7D2VCTOLWW, BIOS D8E132H-3.11 09/05/2023
[Wed Dec 4 16:14:13 2024] RIP: 0010:dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] Code: ff ff ff 4c 89 e7 c6 05 ad d2 6c 01 01 e8 93 37 fa ff 45 89 f8 44 89 f1 4c 89 e6 48 89 c2 48 c7 c7 28 aa 3f 89 e8 ab 62 6a ff <0f> 0b e9 2e ff ff ff 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
[Wed Dec 4 16:14:13 2024] RSP: 0018:ffffb2214e42cea0 EFLAGS: 00010286
[Wed Dec 4 16:14:13 2024] RAX: 0000000000000000 RBX: ffff900216b00488 RCX: 0000000000000000
[Wed Dec 4 16:14:13 2024] RDX: ffff9040cf52d780 RSI: ffff9040cf520840 RDI: 0000000000000300
[Wed Dec 4 16:14:13 2024] RBP: ffff900217801680 R08: 80000000ffff89c6 R09: 0000000000ffff0a
[Wed Dec 4 16:14:13 2024] R10: 0000000000000004 R11: 000000000000004c R12: ffff900216b00000
[Wed Dec 4 16:14:13 2024] R13: ffff900216b003dc R14: 000000000000000f R15: 0000000000003e80
[Wed Dec 4 16:14:13 2024] FS: 0000000000000000(0000) GS:ffff9040cf500000(0000) knlGS:0000000000000000
[Wed Dec 4 16:14:13 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Dec 4 16:14:13 2024] CR2: 00007f8cd72984e0 CR3: 00000046f6010004 CR4: 0000000000770ee0
[Wed Dec 4 16:14:13 2024] PKRU: 55555554
[Wed Dec 4 16:14:13 2024] Call Trace:
[Wed Dec 4 16:14:13 2024]
[Wed Dec 4 16:14:13 2024] ? show_trace_log_lvl+0x1c4/0x2df
[Wed Dec 4 16:14:13 2024] ? show_trace_log_lvl+0x1c4/0x2df
[Wed Dec 4 16:14:13 2024] ? call_timer_fn+0x24/0x130
[Wed Dec 4 16:14:13 2024] ? dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] ? __warn+0x81/0x110
[Wed Dec 4 16:14:13 2024] ? dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] ? report_bug+0x10a/0x140
[Wed Dec 4 16:14:13 2024] ? handle_bug+0x3c/0x70
[Wed Dec 4 16:14:13 2024] ? exc_invalid_op+0x14/0x70
[Wed Dec 4 16:14:13 2024] ? asm_exc_invalid_op+0x16/0x20
[Wed Dec 4 16:14:13 2024] ? dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] ? __pfx_dev_watchdog+0x10/0x10
[Wed Dec 4 16:14:13 2024] ? __pfx_dev_watchdog+0x10/0x10
[Wed Dec 4 16:14:13 2024] call_timer_fn+0x24/0x130
[Wed Dec 4 16:14:13 2024] __run_timers.part.0+0x1ee/0x280
[Wed Dec 4 16:14:13 2024] ? __pfx_tick_sched_timer+0x10/0x10
[Wed Dec 4 16:14:13 2024] ? __hrtimer_run_queues+0x139/0x2c0
[Wed Dec 4 16:14:13 2024] ? ktime_get+0x35/0xa0
[Wed Dec 4 16:14:13 2024] run_timer_softirq+0x26/0x50
[Wed Dec 4 16:14:13 2024] __do_softirq+0xc7/0x2ac
[Wed Dec 4 16:14:13 2024] __irq_exit_rcu+0xa1/0xc0
[Wed Dec 4 16:14:13 2024] sysvec_apic_timer_interrupt+0x72/0x90
[Wed Dec 4 16:14:13 2024]
[Wed Dec 4 16:14:13 2024]
[Wed Dec 4 16:14:13 2024] asm_sysvec_apic_timer_interrupt+0x16/0x20
[Wed Dec 4 16:14:13 2024] RIP: 0010:cpuidle_enter_state+0xca/0x430
[Wed Dec 4 16:14:13 2024] Code: 4f ff 65 8b 3d 53 cd 5a 77 e8 62 4a 4f ff 49 89 c5 0f 1f 44 00 00 31 ff e8 e3 64 4e ff 45 84 ff 0f 85 2a 01 00 00 fb 45 85 f6 <0f> 88 2c 01 00 00 49 63 d6 4c 2b 2c 24 48 8d 04 52 48 8d 04 82 49
[Wed Dec 4 16:14:13 2024] RSP: 0018:ffffb221409a7e80 EFLAGS: 00000202
[Wed Dec 4 16:14:13 2024] RAX: ffff9040cf532ec0 RBX: 0000000000000002 RCX: 000000000000001f
[Wed Dec 4 16:14:13 2024] RDX: 0000000000000000 RSI: 00000000401ec933 RDI: 0000000000000000
[Wed Dec 4 16:14:13 2024] RBP: ffff9001fb26ac00 R08: 00000203ce74fd71 R09: 0000000000000001
[Wed Dec 4 16:14:13 2024] R10: 00000000003d0827 R11: 0000000000198712 R12: ffffffff89ec6500
[Wed Dec 4 16:14:13 2024] R13: 00000203ce74fd71 R14: 0000000000000002 R15: 0000000000000000
[Wed Dec 4 16:14:13 2024] cpuidle_enter+0x29/0x40
[Wed Dec 4 16:14:13 2024] cpuidle_idle_call+0xfa/0x160
[Wed Dec 4 16:14:13 2024] do_idle+0x7b/0xe0
[Wed Dec 4 16:14:13 2024] cpu_startup_entry+0x19/0x20
[Wed Dec 4 16:14:13 2024] start_secondary+0x13f/0x170
[Wed Dec 4 16:14:13 2024] secondary_startup_64_no_verify+0xe4/0xeb
[Wed Dec 4 16:14:13 2024]
[Wed Dec 4 16:14:13 2024] —[ end trace 8b05639dda3dea67 ]—
My PPS load is 60mpps and CPU load is 30%