Hello,
I am experiencing reproducible kernel panics in __ib_process_cq() when using both ib_iser and ib_isert simultaneously with ConnectX-5 on kernel 5.15/6.1 and MLNX_OFED.
System setup:
Architecture: x86_64, Debian-based
Hardware: TAROX R2242i G6, Intel S2600WFT, ConnectX-5 (MCX516A-CCA_Ax)
Mellanox Firmware: 16.35.4030
Kernel versions tested:
5.15.178 (with OFED)
6.1.128 (without OFED – using in-tree drivers)
OFED versions tested (only with kernel 5.15):
5.8-6.0.4.2-LTS
25.01-0.6.0
The issue occurs on all OFED versions tested with 5.15.
Same crash pattern is observed even on kernel 6.1 without OFED (in-tree RDMA stack).
Symptoms:
Panic always happens inside __ib_process_cq() (from ib_core.ko)
Most often the panic occurs during system shutdown or reboot, suggesting it happens during session or connection teardown.
Call traces:
5.15.178 with OFED 5.8-6.0.4.2-LTS:
[ 2740.653334] BUG: unable to handle page fault for address: 0000000000100480
[ 2740.653348] #PF: supervisor instruction fetch in kernel mode
[ 2740.653350] #PF: error_code(0x0010) - not-present page
[ 2740.653352] PGD e54c933067 P4D e54c933067 PUD d95a381067 PMD 0
[ 2740.653355] Oops: 0010 [#1] SMP NOPTI
[ 2740.653358] CPU: 4 PID: 18745 Comm: targetcli Kdump: loaded Tainted: P O 5.15.178 #7 b1582a1eeae429a050e4a733f8c80b7a8fa79a93
[ 2740.653363] Hardware name: TAROX ParX R2242i G6 Server/S2600WFT, BIOS SE5C620.86B.02.01.0012.070720200218 07/07/2020
[ 2740.653364] RIP: 0010:0x100480
[ 2740.653367] Code: Unable to access opcode bytes at RIP 0x100456.
[ 2740.653368] RSP: 0000:ffff895c88063e20 EFLAGS: 00010286
[ 2740.653370] RAX: ffff88811aac6000 RBX: ffff88811aac6048 RCX: 0000000000000001
[ 2740.653372] RDX: 0000000000100480 RSI: ffff88811aac6000 RDI: ffff888185465800
[ 2740.653373] RBP: ffff888185465800 R08: 0000000000000029 R09: 0000000000001fff
[ 2740.653374] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 2740.653375] R13: 0000000000000000 R14: ffff88811aac6000 R15: 0000000000000010
[ 2740.653377] FS: 00007f4d21e60700(0000) GS:ffff896ca1400000(0000) knlGS:0000000000000000
[ 2740.653378] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2740.653380] CR2: 0000000000100480 CR3: 000000e1ae3f0001 CR4: 00000000007706e0
[ 2740.653381] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2740.653382] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2740.653383] PKRU: 55555554
[ 2740.653383] Call Trace:
[ 2740.653386]
[ 2740.653387] ? __die_body.cold+0x1a/0x1f
[ 2740.653396] ? page_fault_oops+0x7a/0x1f0
[ 2740.653403] ? exc_page_fault+0x7b/0x150
[ 2740.653407] ? asm_exc_page_fault+0x22/0x30
[ 2740.653414] ? __ib_process_cq+0x97/0x1a0 [ib_core f68e58b9c1a917ab96de9e1f308775d827bf696d]
[ 2740.653450] ? ib_poll_handler+0x2c/0xc0 [ib_core f68e58b9c1a917ab96de9e1f308775d827bf696d]
[ 2740.653473] ? irq_poll_softirq+0xa4/0x120
[ 2740.653477] ? handle_softirqs+0xe4/0x270
[ 2740.653481] ? irq_exit_rcu+0x93/0xc0
[ 2740.653483] ? common_interrupt+0x44/0xa0
[ 2740.653485] ? asm_common_interrupt+0x22/0x40
[ 2740.653489]
[ 2740.653489] Modules linked in: iscsi_scst(O) scst_vdisk(O) scst(O) ib_isert(O) target_core_file zfs(PO) qat_api(O) spl(O) intel_qat(O) uio iptable_filter mst_pciconf(O) target_core_iblock target_core_pscsi iscsi_target_mod target_core_mod nvmet_rdma nvmet_tcp nvmet nvme_rdma nvme_tcp nvme_fabrics bonding ib_iser(O) rdma_cm(O) iw_cm(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse mlx5_ib(O) ib_uverbs(O) ib_umad(O) ib_ipoib(O) ib_cm(O) ib_core(O) mlx4_ib(O) mlx4_en mlx4_core bnxt_en(O) mlx5_core(O) isst_if_common x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel aesni_intel crypto_simd cryptd rapl i40e(O) intel_cstate mlxdevm(O) mlxfw(O) ixgbe(O) pci_hyperv_intf mlx_compat(O) ptp switchtec pps_core acpi_pad button nls_iso8859_1 nls_cp437 nvme nvme_core sg ipmi_si ipmi_devintf ipmi_msghandler vfat fat aufs scsi_transport_fc [last unloaded: ipmi_watchdog]
[ 2740.653548] CR2: 0000000000100480
5.15.178 with OFED 25.01-0.6.0:
[ 37.476091] mlx5_core 0000:af:00.0: enabling device (0140 → 0142)
[ 37.476356] mlx5_core 0000:af:00.0: firmware version: 16.35.4030
[ 37.476388] mlx5_core 0000:af:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 37.676600] Broadcom NetXtreme-C/E/S driver bnxt_en v1.10.3-232.0.155.5+
[ 37.676927] bnxt_en 0000:18:00.0 (unnamed net_device) (uninitialized): Device requests max timeout of 100 seconds, may trigger hung task watchdog
[ 37.701197] bnxt_en 0000:18:00.0 eth4: Broadcom BCM57414 NetXtreme-E 10Gb/25Gb Ethernet found at mem 387fffe10000, node addr b0:26:28:62:79:50
[ 37.701204] bnxt_en 0000:18:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 37.701513] bnxt_en 0000:18:00.1 (unnamed net_device) (uninitialized): Device requests max timeout of 100 seconds, may trigger hung task watchdog
[ 37.729103] bnxt_en 0000:18:00.1 eth5: Broadcom BCM57414 NetXtreme-E 10Gb/25Gb Ethernet found at mem 387fffe00000, node addr b0:26:28:62:79:51
[ 37.729112] bnxt_en 0000:18:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 37.890480] mlx5_core 0000:af:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 37.890667] mlx5_core 0000:af:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[ 37.895964] mlx5_core 0000:af:00.0: Port module event: module 0, Cable plugged
[ 37.896217] mlx5_core 0000:af:00.0: mlx5_pcie_event:304:(pid 9): PCIe slot advertised sufficient power (27W).
[ 37.908283] mlx5_core 0000:af:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[ 38.115315] mlx5_core 0000:af:00.1: enabling device (0140 → 0142)
[ 38.115579] mlx5_core 0000:af:00.1: firmware version: 16.35.4030
[ 38.115611] mlx5_core 0000:af:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 38.552085] mlx5_core 0000:af:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 38.552287] mlx5_core 0000:af:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[ 38.557730] mlx5_core 0000:af:00.1: Port module event: module 1, Cable unplugged
[ 38.557983] mlx5_core 0000:af:00.1: mlx5_pcie_event:304:(pid 11589): PCIe slot advertised sufficient power (27W).
[ 38.568017] mlx5_core 0000:af:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
…
…
[15600.188169] iser: iser_err_comp: control failure: transport retry counter exceeded (12) vend_err 0x81
[15600.188411] general protection fault, probably for non-canonical address 0x6a0f032f02100374: 0000 [#1] SMP NOPTI
[15600.188415] CPU: 14 PID: 6130 Comm: ledctl Kdump: loaded Tainted: P O 5.15.178 #7 b1582a1eeae429a050e4a733f8c80b7a8fa79a93
[15600.188420] Hardware name: TAROX ParX R2242i G6 Server/S2600WFT, BIOS SE5C620.86B.02.01.0016.032120230338 03/21/2023
[15600.188421] RIP: 0010:__ib_process_cq+0x87/0x160 [ib_core]
[15600.188459] Code: 37 a8 04 00 85 c0 7f 7f 45 85 e4 7e 64 48 8b 04 24 49 63 d4 48 8d 14 d2 49 89 c6 48 8d 1c d0 eb 17 48 8b 12 4c 89 f6 48 89 ef d2 0f 1f 00 49 83 c6 48 4c 39 f3 74 1b 49 8b 16 48 85 d2 75 e1
[15600.188461] RSP: 0018:ffff88a82fd85ee8 EFLAGS: 00010286
[15600.188464] RAX: ffff88819030f000 RBX: ffff88819030f048 RCX: 0000000000000001
[15600.188465] RDX: 6a0f032f02100374 RSI: ffff88819030f000 RDI: ffff8881af4d5400
[15600.188466] RBP: ffff8881af4d5400 R08: 0000000001f905b8 R09: 0000000000002000
[15600.188467] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[15600.188468] R13: 0000000000000000 R14: ffff88819030f000 R15: 0000000000000010
[15600.188470] FS: 00007fe90a015700(0000) GS:ffff88a82fd80000(0000) knlGS:0000000000000000
[15600.188471] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15600.188473] CR2: 000000002fbfc000 CR3: 0000001c6d7b0002 CR4: 00000000007706e0
[15600.188474] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[15600.188475] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[15600.188476] PKRU: 55555554
[15600.188477] Call Trace:
[15600.188480]
[15600.188482] ? __die_body.cold+0x1a/0x1f
[15600.188490] ? die_addr+0x38/0x60
[15600.188495] ? exc_general_protection+0x1bc/0x410
[15600.188502] ? asm_exc_general_protection+0x22/0x30
[15600.188507] ? __ib_process_cq+0x87/0x160 [ib_core b0210261ffea76f411252b6af10bbc3e1e7b90c1]
[15600.188528] ? __ib_process_cq+0x89/0x160 [ib_core b0210261ffea76f411252b6af10bbc3e1e7b90c1]
[15600.188550] ib_poll_handler+0x2c/0xc0 [ib_core b0210261ffea76f411252b6af10bbc3e1e7b90c1]
[15600.188571] irq_poll_softirq+0xa4/0x120
[15600.188577] handle_softirqs+0xe4/0x270
[15600.188583] irq_exit_rcu+0x93/0xc0
[15600.188586] common_interrupt+0x82/0xa0
[15600.188588]
[15600.188589]
[15600.188589] asm_common_interrupt+0x22/0x40
[15600.188592] RIP: 0010:generic_permission+0x7d/0x290
Clean 6.1.128 without OFED:
[ 664.888430] iser: iser_err_comp: command failure: transport retry counter exceeded (12) vend_err 0x81
[ 664.888788] iser: iser_err_comp: command failure: transport retry counter exceeded (12) vend_err 0x81
[ 664.888823] general protection fault, probably for non-canonical address 0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI
[ 664.888828] CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.128 #1 ab8ea0ccaca481c37db87ad91c7d66aee175cbb6
[ 664.888833] Hardware name: TAROX ParX R2242i G6 Server/S2600WFT, BIOS SE5C620.86B.02.01.0016.032120230338 03/21/2023
[ 664.888834] RIP: 0010:__ib_process_cq+0x87/0x190 [ib_core]
[ 664.888873] Code: 47 5e 04 00 85 c0 7f 7f 45 85 e4 7e 64 48 8b 04 24 49 63 d4 48 8d 14 d2 49 89 c6 48 8d 1c d0 eb 17 48 8b 12 4c 89 f6 48 89 ef d2 0f 1f 00 49 83 c6 48 4c 39 f3 74 1b 49 8b 16 48 85 d2 75 e1
[ 664.888875] RSP: 0018:ffff88a82fcc5ee8 EFLAGS: 00010286
[ 664.888878] RAX: ffff8894e68de000 RBX: ffff8894e68de048 RCX: 0000000000000001
[ 664.888879] RDX: dead000000000122 RSI: ffff8894e68de000 RDI: ffff8894ec11dc00
[ 664.888881] RBP: ffff8894ec11dc00 R08: 00000000001f68b6 R09: 0000000000002000
[ 664.888882] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 664.888883] R13: 0000000000000000 R14: ffff8894e68de000 R15: 0000000000000010
[ 664.888885] FS: 0000000000000000(0000) GS:ffff88a82fcc0000(0000) knlGS:0000000000000000
[ 664.888887] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 664.888888] CR2: 00007f6f3f4dce90 CR3: 0000000005c0a005 CR4: 00000000007706e0
[ 664.888890] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 664.888891] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 664.888893] PKRU: 55555554
[ 664.888893] Call Trace:
[ 664.888897]
[ 664.888899] ? __die_body.cold+0x1a/0x1f
[ 664.888906] ? die_addr+0x38/0x60
[ 664.888912] ? exc_general_protection+0x1b8/0x370
[ 664.888917] ? asm_exc_general_protection+0x22/0x30
[ 664.888924] ? __ib_process_cq+0x87/0x190 [ib_core ee7f0c75fd1d6417948c0612056162bccbef758a]
[ 664.888941] ? __ib_process_cq+0x89/0x190 [ib_core ee7f0c75fd1d6417948c0612056162bccbef758a]
[ 664.888961] ib_poll_handler+0x2c/0xd0 [ib_core ee7f0c75fd1d6417948c0612056162bccbef758a]
[ 664.888980] irq_poll_softirq+0xa2/0x120
[ 664.888986] handle_softirqs+0xe1/0x290
[ 664.888991] __irq_exit_rcu+0x8d/0xc0
[ 664.888993] common_interrupt+0x82/0xa0
[ 664.888996]
[ 664.888997]
[ 664.888998] asm_common_interrupt+0x22/0x40
[ 664.889002] RIP: 0010:cpuidle_enter_state+0xf0/0x420
[ 664.889006] Code: 00 00 31 ff e8 51 8c 6b ff 45 84 ff 74 16 9c 58 0f 1f 40 00 f6 c4 02 0f 85 1f 03 00 00 31 ff e8 e6 e6 70 ff fb 0f 1f 44 00 00 <45> 85 f6 0f 88 27 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[ 664.889008] RSP: 0018:ffff889486303e98 EFLAGS: 00000246
[ 664.889010] RAX: ffff88a82fced200 RBX: ffffe8ffffcc3fd8 RCX: 0000000000000000
[ 664.889011] RDX: 0000000000000007 RSI: 0000000000000002 RDI: 0000000000000000
[ 664.889013] RBP: 0000000000000003 R08: ffffb366110bfe75 R09: 0000000021c36ae8
[ 664.889014] R10: 0000000000000018 R11: 0000000000001d5b R12: ffffffff830ac560
[ 664.889015] R13: 0000009ace76219a R14: 0000000000000003 R15: 0000000000000000
[ 664.889018] ? cpuidle_enter_state+0xcf/0x420
[ 664.889021] cpuidle_enter+0x29/0x40
[ 664.889022] do_idle+0x1e8/0x260
[ 664.889025] cpu_startup_entry+0x26/0x30
[ 664.889027] start_secondary+0x103/0x110
[ 664.889032] secondary_startup_64_no_verify+0xce/0xdb
[ 664.889038]
[ 664.889038] Modules linked in: ib_isert iscsi_scst(O) scst_vdisk(O) scst(O) dlm target_core_file zfs(O) spl(O) iptable_filter mst_pciconf(O) target_core_iblock target_core_pscsi iscsi_target_mod target_core_mod bonding ib_iser rdma_cm iw_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse mlx5_ib ib_umad ib_ipoib ib_cm mlx4_ib ib_uverbs ib_core mlx4_en mlx4_core isst_if_common x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel polyval_clmulni polyval_generic aesni_intel crypto_simd cryptd mlx5_core bnx2x mlxfw pci_hyperv_intf switchtec mdio i40e(O) rapl intel_cstate ptp pps_core xhci_pci xhci_pci_renesas acpi_pad button nls_iso8859_1 nls_cp437 nvme nvme_core nvme_common sg ipmi_si ipmi_devintf ipmi_msghandler vfat fat aufs scsi_transport_fc [last unloaded: ipmi_watchdog]
The problem is repeatable every few restarts.
Is this a known issue in the ib_core event handling or CQ polling logic?
Are there any fixes or recommended patches available upstream or in more recent OFED versions?
Please let me know if any additional logs are needed.
Thanks
Artur Piechocki