NETDEV WATCHDOG: eth0 (mlx5_core): transmit queue 0 timed out kernel:[47075.368840] watchdog: BUG: soft lockup - CPU#64 stuck for 22s! [ksoftirqd/64:333]

I have a problem with 100G NICs. In the evening, when traffic peaks, mellanox NIC generate 100 IRQ on one core. At such time I’m watching network degradation. Please resolve that problem!

Jun 12 22:12:22 138224 kernel: [45980.924388] ------------[ cut here ]------------

Jun 12 22:12:22 138224 kernel: [45980.924390] NETDEV WATCHDOG: eth0 (mlx5_core): transmit queue 0 timed out

Jun 12 22:12:22 138224 kernel: [45980.924445] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:466 dev_watchdog+0x20d/0x220

Jun 12 22:12:22 138224 kernel: [45980.924447] Modules linked in: binfmt_misc msr mst_pciconf(OE) amd64_edac_mod edac_mce_amd ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr aufs(OE) ast joydev ttm drm_kms_helper drm evdev sg i2c_algo_bit ccp rng_core sp5100_tco ipmi_si ipmi_devintf ipmi_msghandler pcc_cpufreq acpi_cpufreq button tcp_bbr sch_fq bonding lp parport loop ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear hid_generic usbhid hid raid1 md_mod sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ahci libahci libata nvme xhci_pci xhci_hcd nvme_core scsi_mod mlx5_core(OE) usbcore mlxfw(OE) mlx_compat(OE) devlink i2c_piix4 usb_common

Jun 12 22:12:22 138224 kernel: [45980.924485] [last unloaded: mst_pci]

Jun 12 22:12:22 138224 kernel: [45980.924488] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G OE 4.19.0-16-amd64 #1 Debian 4.19.181-1

Jun 12 22:12:22 138224 kernel: [45980.924489] Hardware name: Supermicro AS -2124BT-HNTR/H12DST-B, BIOS 1.1 01/10/2020

Jun 12 22:12:22 138224 kernel: [45980.924491] RIP: 0010:dev_watchdog+0x20d/0x220

Jun 12 22:12:22 138224 kernel: [45980.924493] Code: 00 49 63 4e e0 eb 92 4c 89 e7 c6 05 8f 09 b0 00 01 e8 97 bd fc ff 89 d9 4c 89 e6 48 c7 c7 a0 01 6e b8 48 89 c2 e8 5c 89 10 00 <0f> 0b eb c0 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44

Jun 12 22:12:22 138224 kernel: [45980.924494] RSP: 0018:ffff89b30dc83e90 EFLAGS: 00010286

Jun 12 22:12:22 138224 kernel: [45980.924495] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006

Jun 12 22:12:22 138224 kernel: [45980.924495] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff89b30dc966b0

Jun 12 22:12:22 138224 kernel: [45980.924496] RBP: ffff89b2c5b8045c R08: 0000000000000943 R09: 0000000000000004

Jun 12 22:12:22 138224 kernel: [45980.924497] R10: 0000000000000000 R11: 0000000000000001 R12: ffff89b2c5b80000

Jun 12 22:12:22 138224 kernel: [45980.924498] R13: 0000000000000002 R14: ffff89b2c5b80480 R15: 0000000000000208

Jun 12 22:12:22 138224 kernel: [45980.924499] FS: 0000000000000000(0000) GS:ffff89b30dc80000(0000) knlGS:0000000000000000

Jun 12 22:12:22 138224 kernel: [45980.924500] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Jun 12 22:12:22 138224 kernel: [45980.924500] CR2: 000055af702e6708 CR3: 0000011b1800a000 CR4: 0000000000340ee0

Jun 12 22:12:22 138224 kernel: [45980.924501] Call Trace:

Jun 12 22:12:22 138224 kernel: [45980.924504]

Jun 12 22:12:22 138224 kernel: [45980.924508] ? pfifo_fast_enqueue+0x110/0x110

Jun 12 22:12:22 138224 kernel: [45980.924513] call_timer_fn+0x2b/0x130

Jun 12 22:12:22 138224 kernel: [45980.924515] run_timer_softirq+0x1c7/0x3e0

Jun 12 22:12:22 138224 kernel: [45980.924517] ? ktime_get+0x3a/0xa0

Jun 12 22:12:22 138224 kernel: [45980.924520] __do_softirq+0xde/0x2d8

Jun 12 22:12:22 138224 kernel: [45980.924526] irq_exit+0xba/0xc0

Jun 12 22:12:22 138224 kernel: [45980.924527] smp_apic_timer_interrupt+0x74/0x140

Jun 12 22:12:22 138224 kernel: [45980.924530] apic_timer_interrupt+0xf/0x20

Jun 12 22:12:22 138224 kernel: [45980.924531]

Jun 12 22:12:22 138224 kernel: [45980.924533] RIP: 0010:native_safe_halt+0xe/0x10

Jun 12 22:12:22 138224 kernel: [45980.924534] Code: ff ff 7f c3 65 48 8b 04 25 40 5c 01 00 f0 80 48 02 20 48 8b 00 a8 08 75 c4 eb 80 90 e9 07 00 00 00 0f 00 2d f6 2f 4d 00 fb f4 90 e9 07 00 00 00 0f 00 2d e6 2f 4d 00 f4 c3 90 90 0f 1f 44 00

Jun 12 22:12:22 138224 kernel: [45980.924534] RSP: 0018:ffff9e9558a2bea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13

Jun 12 22:12:22 138224 kernel: [45980.924535] RAX: ffffffffb7f34aa0 RBX: 0000000000000002 RCX: ffffffffb884f290

Jun 12 22:12:22 138224 kernel: [45980.924536] RDX: 00000000a5bbe6da RSI: ffffffffb884aef8 RDI: 000029d1ff87e600

Jun 12 22:12:22 138224 kernel: [45980.924537] RBP: 0000000000000002 R08: 0000000000000002 R09: 0000000000021a00

Jun 12 22:12:22 138224 kernel: [45980.924537] R10: 00005e4ba43b061e R11: 0000000000000000 R12: 0000000000000000

Jun 12 22:12:22 138224 kernel: [45980.924537] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Jun 12 22:12:22 138224 kernel: [45980.924539] ? __sched_text_end+0x7/0x7

Jun 12 22:12:22 138224 kernel: [45980.924540] default_idle+0x1c/0x140

Jun 12 22:12:22 138224 kernel: [45980.924545] do_idle+0x1e3/0x270

Jun 12 22:12:22 138224 kernel: [45980.924547] cpu_startup_entry+0x6f/0x80

Jun 12 22:12:22 138224 kernel: [45980.924550] start_secondary+0x1a4/0x200

Jun 12 22:12:22 138224 kernel: [45980.924554] secondary_startup_64+0xa4/0xb0

Jun 12 22:12:22 138224 kernel: [45980.924556] —[ end trace 731a34cdffaba186 ]—

Jun 12 22:12:20 138224 kernel: [45979.448511] NMI backtrace for cpu 64

Jun 12 22:12:20 138224 kernel: [45979.448514] CPU: 64 PID: 333 Comm: ksoftirqd/64 Tainted: G OE 4.19.0-16-amd64 #1 Debian 4.19.181-1

Jun 12 22:12:20 138224 kernel: [45979.448515] Hardware name: Supermicro AS -2124BT-HNTR/H12DST-B, BIOS 1.1 01/10/2020

Jun 12 22:12:20 138224 kernel: [45979.448516] Call Trace:

Jun 12 22:12:20 138224 kernel: [45979.448518]

Jun 12 22:12:20 138224 kernel: [45979.448525] dump_stack+0x66/0x81

Jun 12 22:12:20 138224 kernel: [45979.448527] nmi_cpu_backtrace.cold.4+0x13/0x50

Jun 12 22:12:20 138224 kernel: [45979.448531] ? lapic_can_unplug_cpu+0x80/0x80

Jun 12 22:12:20 138224 kernel: [45979.448534] nmi_trigger_cpumask_backtrace+0xf9/0x100

Jun 12 22:12:20 138224 kernel: [45979.448536] rcu_dump_cpu_stacks+0x9b/0xcb

Jun 12 22:12:20 138224 kernel: [45979.448537] rcu_check_callbacks.cold.81+0x1db/0x335

Jun 12 22:12:20 138224 kernel: [45979.448540] ? tick_sched_do_timer+0x60/0x60

Jun 12 22:12:20 138224 kernel: [45979.448542] update_process_times+0x28/0x60

Jun 12 22:12:20 138224 kernel: [45979.448543] tick_sched_handle+0x22/0x60

Jun 12 22:12:20 138224 kernel: [45979.448544] tick_sched_timer+0x37/0x70

Jun 12 22:12:20 138224 kernel: [45979.448546] __hrtimer_run_queues+0x100/0x280

Jun 12 22:12:20 138224 kernel: [45979.448547] hrtimer_interrupt+0x100/0x210

Jun 12 22:12:20 138224 kernel: [45979.448549] ? __perf_event_read+0xf5/0x230

Jun 12 22:12:20 138224 kernel: [45979.448551] smp_apic_timer_interrupt+0x6a/0x140

Jun 12 22:12:20 138224 kernel: [45979.448553] apic_timer_interrupt+0xf/0x20

Jun 12 22:12:20 138224 kernel: [45979.448554]

Jun 12 22:12:20 138224 kernel: [45979.448556] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20

Jun 12 22:12:20 138224 kernel: [45979.448558] Code: d8 48 3d 90 d0 03 00 76 cc 80 4d 00 08 eb 98 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 07

Jun 12 22:12:20 138224 kernel: [45979.448558] RSP: 0018:ffff9e955980bcb0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13

Jun 12 22:12:20 138224 kernel: [45979.448560] RAX: 0000000011c00040 RBX: ffff89a9420de178 RCX: ffff89af31168c78

Jun 12 22:12:20 138224 kernel: [45979.448560] RDX: ffff89a9420de198 RSI: 0000000000000246 RDI: 0000000000000246

Jun 12 22:12:20 138224 kernel: [45979.448561] RBP: 0000000100ae4230 R08: ffff8ab2be800000 R09: 000000000000a4f6

Jun 12 22:12:20 138224 kernel: [45979.448561] R10: 000000000000527b R11: 0000000000000000 R12: ffff8ab2be81a7c0

Jun 12 22:12:20 138224 kernel: [45979.448562] R13: 0000000000000000 R14: ffff8ab2be81a7c0 R15: 000000001c000040

Jun 12 22:12:20 138224 kernel: [45979.448564] mod_timer+0x177/0x400

Jun 12 22:12:20 138224 kernel: [45979.448567] sk_reset_timer+0x14/0x30

Jun 12 22:12:20 138224 kernel: [45979.448570] tcp_retransmit_timer+0x530/0xa40

Jun 12 22:12:20 138224 kernel: [45979.448572] tcp_write_timer_handler+0xb1/0x210

Jun 12 22:12:20 138224 kernel: [45979.448573] tcp_write_timer+0x71/0x90

Jun 12 22:12:20 138224 kernel: [45979.448574] ? tcp_write_timer_handler+0x210/0x210

Jun 12 22:12:20 138224 kernel: [45979.448575] call_timer_fn+0x2b/0x130

Jun 12 22:12:20 138224 kernel: [45979.448576] run_timer_softirq+0x1c7/0x3e0

Jun 12 22:12:20 138224 kernel: [45979.448577] __do_softirq+0xde/0x2d8

Jun 12 22:12:20 138224 kernel: [45979.448581] ? sort_range+0x20/0x20

Jun 12 22:12:20 138224 kernel: [45979.448584] run_ksoftirqd+0x26/0x40

Jun 12 22:12:20 138224 kernel: [45979.448585] smpboot_thread_fn+0xc5/0x160

Jun 12 22:12:20 138224 kernel: [45979.448588] kthread+0x112/0x130

Jun 12 22:12:20 138224 kernel: [45979.448589] ? kthread_bind+0x30/0x30

Jun 12 22:12:20 138224 kernel: [45979.448590] ret_from_fork+0x22/0x40