Mellanox x6 problem out of queu timeout

Hi!
I have host with 2 mellanox x6 dx NICx in bonding and attached ebpf/xdp. My prog just return all packets back to port via XDP_TX()
Some times NICs goes down with messages
ACCESS_REG: cancaled on out of queu timeout(attach the sceen)


After that only hard reset via mlxconf return NICs back to work.
I tried different driver version, update FW to firmware version: 20.43.1014
log from dmesg looks like this
[Wed Dec 4 16:14:13 2024] ------------[ cut here ]------------
[Wed Dec 4 16:14:13 2024] NETDEV WATCHDOG: ens4np0 (mlx5_core): transmit queue 15 timed out 16000 ms
[Wed Dec 4 16:14:13 2024] WARNING: CPU: 108 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] Modules linked in: nf_tables nfnetlink vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd cuse 8021q garp mrp stp llc bonding rfkill vfat fat amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm wmi_bmof irqbypass rapl acpi_cpufreq pcspkr ipmi_ssif mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ipmi_si ipmi_devintf ptdma i2c_piix4 k10temp ipmi_msghandler joydev auth_rpcgss fuse drm sunrpc xfs libcrc32c mlx5_ib ib_uverbs ib_core sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mlx5_core ahci mlxfw libahci ixgbe psample libata mdio tls ccp dca pci_hyperv_intf sp5100_tco wmi dm_mirror dm_region_hash dm_log dm_mod xpmem(OE)
[Wed Dec 4 16:14:13 2024] CPU: 108 PID: 0 Comm: swapper/108 Kdump: loaded Tainted: G OE ------- — 5.14.0-427.31.1.el9_4.x86_64 #1
[Wed Dec 4 16:14:13 2024] Hardware name: Lenovo ThinkSystem SR665/7D2VCTOLWW, BIOS D8E132H-3.11 09/05/2023
[Wed Dec 4 16:14:13 2024] RIP: 0010:dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] Code: ff ff ff 4c 89 e7 c6 05 ad d2 6c 01 01 e8 93 37 fa ff 45 89 f8 44 89 f1 4c 89 e6 48 89 c2 48 c7 c7 28 aa 3f 89 e8 ab 62 6a ff <0f> 0b e9 2e ff ff ff 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
[Wed Dec 4 16:14:13 2024] RSP: 0018:ffffb2214e42cea0 EFLAGS: 00010286
[Wed Dec 4 16:14:13 2024] RAX: 0000000000000000 RBX: ffff900216b00488 RCX: 0000000000000000
[Wed Dec 4 16:14:13 2024] RDX: ffff9040cf52d780 RSI: ffff9040cf520840 RDI: 0000000000000300
[Wed Dec 4 16:14:13 2024] RBP: ffff900217801680 R08: 80000000ffff89c6 R09: 0000000000ffff0a
[Wed Dec 4 16:14:13 2024] R10: 0000000000000004 R11: 000000000000004c R12: ffff900216b00000
[Wed Dec 4 16:14:13 2024] R13: ffff900216b003dc R14: 000000000000000f R15: 0000000000003e80
[Wed Dec 4 16:14:13 2024] FS: 0000000000000000(0000) GS:ffff9040cf500000(0000) knlGS:0000000000000000
[Wed Dec 4 16:14:13 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Dec 4 16:14:13 2024] CR2: 00007f8cd72984e0 CR3: 00000046f6010004 CR4: 0000000000770ee0
[Wed Dec 4 16:14:13 2024] PKRU: 55555554
[Wed Dec 4 16:14:13 2024] Call Trace:
[Wed Dec 4 16:14:13 2024]
[Wed Dec 4 16:14:13 2024] ? show_trace_log_lvl+0x1c4/0x2df
[Wed Dec 4 16:14:13 2024] ? show_trace_log_lvl+0x1c4/0x2df
[Wed Dec 4 16:14:13 2024] ? call_timer_fn+0x24/0x130
[Wed Dec 4 16:14:13 2024] ? dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] ? __warn+0x81/0x110
[Wed Dec 4 16:14:13 2024] ? dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] ? report_bug+0x10a/0x140
[Wed Dec 4 16:14:13 2024] ? handle_bug+0x3c/0x70
[Wed Dec 4 16:14:13 2024] ? exc_invalid_op+0x14/0x70
[Wed Dec 4 16:14:13 2024] ? asm_exc_invalid_op+0x16/0x20
[Wed Dec 4 16:14:13 2024] ? dev_watchdog+0x215/0x220
[Wed Dec 4 16:14:13 2024] ? __pfx_dev_watchdog+0x10/0x10
[Wed Dec 4 16:14:13 2024] ? __pfx_dev_watchdog+0x10/0x10
[Wed Dec 4 16:14:13 2024] call_timer_fn+0x24/0x130
[Wed Dec 4 16:14:13 2024] __run_timers.part.0+0x1ee/0x280
[Wed Dec 4 16:14:13 2024] ? __pfx_tick_sched_timer+0x10/0x10
[Wed Dec 4 16:14:13 2024] ? __hrtimer_run_queues+0x139/0x2c0
[Wed Dec 4 16:14:13 2024] ? ktime_get+0x35/0xa0
[Wed Dec 4 16:14:13 2024] run_timer_softirq+0x26/0x50
[Wed Dec 4 16:14:13 2024] __do_softirq+0xc7/0x2ac
[Wed Dec 4 16:14:13 2024] __irq_exit_rcu+0xa1/0xc0
[Wed Dec 4 16:14:13 2024] sysvec_apic_timer_interrupt+0x72/0x90
[Wed Dec 4 16:14:13 2024]
[Wed Dec 4 16:14:13 2024]
[Wed Dec 4 16:14:13 2024] asm_sysvec_apic_timer_interrupt+0x16/0x20
[Wed Dec 4 16:14:13 2024] RIP: 0010:cpuidle_enter_state+0xca/0x430
[Wed Dec 4 16:14:13 2024] Code: 4f ff 65 8b 3d 53 cd 5a 77 e8 62 4a 4f ff 49 89 c5 0f 1f 44 00 00 31 ff e8 e3 64 4e ff 45 84 ff 0f 85 2a 01 00 00 fb 45 85 f6 <0f> 88 2c 01 00 00 49 63 d6 4c 2b 2c 24 48 8d 04 52 48 8d 04 82 49
[Wed Dec 4 16:14:13 2024] RSP: 0018:ffffb221409a7e80 EFLAGS: 00000202
[Wed Dec 4 16:14:13 2024] RAX: ffff9040cf532ec0 RBX: 0000000000000002 RCX: 000000000000001f
[Wed Dec 4 16:14:13 2024] RDX: 0000000000000000 RSI: 00000000401ec933 RDI: 0000000000000000
[Wed Dec 4 16:14:13 2024] RBP: ffff9001fb26ac00 R08: 00000203ce74fd71 R09: 0000000000000001
[Wed Dec 4 16:14:13 2024] R10: 00000000003d0827 R11: 0000000000198712 R12: ffffffff89ec6500
[Wed Dec 4 16:14:13 2024] R13: 00000203ce74fd71 R14: 0000000000000002 R15: 0000000000000000
[Wed Dec 4 16:14:13 2024] cpuidle_enter+0x29/0x40
[Wed Dec 4 16:14:13 2024] cpuidle_idle_call+0xfa/0x160
[Wed Dec 4 16:14:13 2024] do_idle+0x7b/0xe0
[Wed Dec 4 16:14:13 2024] cpu_startup_entry+0x19/0x20
[Wed Dec 4 16:14:13 2024] start_secondary+0x13f/0x170
[Wed Dec 4 16:14:13 2024] secondary_startup_64_no_verify+0xe4/0xeb
[Wed Dec 4 16:14:13 2024]
[Wed Dec 4 16:14:13 2024] —[ end trace 8b05639dda3dea67 ]—
My PPS load is 60mpps and CPU load is 30%

Hi madmax240484,

Thank you for posting your query on NVIDIA Community. Based on the description of the issue shared, it requires extensive data collection and debug along with understanding the replication steps.

Unfortunately, community ticket cannot be used to perform a debug and a valid support ticket will be needed to continue investigation.

If there an active entitlement/support contract in place, please do not hesitate to open a support ticket by emailing enterprisesupport@nvidia.com

For contracts, please reach out to Networking-Contracts@nvidia.com

Thanks,
Namrata.