Kernel Oops in rdma_disconnect [rdma_cm] / MLX_OFED drivers / ConnectX-4 Lx EN

Hello

We are experiencing kernel oops on our NAS/datastores when using iSER/RDMA over Ethernet to serve vmware ESXi 6.7-14320388 hosts.

All NICs are ConnectX-4 Lx EN adapter card, 25GbE dual-port, with FW version 14.25.1020

NAS OS: Debian 10.6 with kernel 4.19.146-1

MLX_OFED drivers version 4.9-0.1.7 (we also tried latest v5.1-2.3.7.1).

To reproduce this BUG:

Given two datastore machines:

nas1 and nas2

and two VMware compute hosts:

esxi01 and esxi02,

where nas1 and nas2 both are connected with iSER to esxi01 and esxi02 over ethernet network (with 2 IPv4 and 2 vlans each),

with running (meaning there is I/O activity on disk) virtual machine vm1 on esxi02 with datastore on nas2,

all machines equipped with ConnectX-4Lx,

and managed using vCenter,

we can most times reproduce Kernel Oops occurring on NAS machine when following happens:

  • existing vm1 is migrated with vMotion (both compute host and datastore) to esx01 and nas1

  • vm1 is migrated back to esx02 but datastore stays on nas1 (only compute host is changed)

  • esx01 is rebooted

  • Kernel Oops occurs on nas1 when esx01 comes back online and attempts to connect back to nas1 datastore

  • because of Kernel Oops on nas1, esx01 WOULD NOT CONNECT BACK to nas1 until nas1 is rebooted.

We experience this when nas1 is swapped with nas2 also.

Any ideas? How to debug this, what to check and what may cause this?

==== dmesg output below ====

[193601.189492] iSCSI Login timeout on Network Portal 172.16.22.248:3260

[193601.189542] isert: isert_get_login_rx: isert_conn 00000000e09afeb9 interrupted before got login req

[193601.189589] iSCSI Login negotiation failed.

[193601.189623] BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8

[193601.189663] PGD 0 P4D 0

[193601.189681] Oops: 0000 [#1] SMP NOPTI

[193601.189704] CPU: 0 PID: 1630 Comm: iscsi_np Tainted: G OE 4.19.0-11-amd64 #1 Debian 4.19.146-1

[193601.189753] Hardware name: Supermicro AS -1113S-WN10RT/H11SSW-NT, BIOS 2.1 02/21/2020

[193601.189802] RIP: 0010:rdma_disconnect+0xa/0x90 [rdma_cm]

[193601.189831] Code: c0 0f 94 c2 39 83 08 01 00 00 0f 94 c0 38 c2 75 cb 31 c0 eb e1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 53 48 89 fb <48> 8b bf a8 02 00 00 48 85 ff 74 65 48 8b 0b 0f b6 83 c0 01 00 00

[193601.189923] RSP: 0018:ffffb78607837e38 EFLAGS: 00010297

[193601.189952] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002

[193601.189988] RDX: ffff8bfdfd488000 RSI: da6751925f4100e8 RDI: 0000000000000000

[193601.190024] RBP: 0000000000000000 R08: 0000000000025520 R09: ffffffffc0817976

[193601.190061] R10: ffffe4813dd3f800 R11: 0000000000000001 R12: ffff8bfe00217000

[193601.190097] R13: ffff8bfdab114000 R14: ffff8bfe00217350 R15: ffff8bfe002173c0

[193601.190134] FS: 0000000000000000(0000) GS:ffff8bfe0ea00000(0000) knlGS:0000000000000000

[193601.190175] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[193601.190205] CR2: 00000000000002a8 CR3: 0000001fc0b6a000 CR4: 0000000000340ef0

[193601.190242] Call Trace:

[193601.190265] isert_conn_terminate+0x2f/0x50 [ib_isert]

[193601.190296] isert_wait_conn+0x51/0x2b0 [ib_isert]

[193601.190336] iscsi_target_login_sess_out+0xa2/0x150 [iscsi_target_mod]

[193601.190382] iscsi_target_login_thread+0x9a5/0xe30 [iscsi_target_mod]

[193601.190426] ? iscsi_target_login_sess_out+0x150/0x150 [iscsi_target_mod]

[193601.190464] kthread+0x112/0x130

[193601.190484] ? kthread_bind+0x30/0x30

[193601.190507] ret_from_fork+0x22/0x40

[193601.190530] Modules linked in: nfnetlink_queue nfnetlink_log nfnetlink bluetooth drbg ansi_cprng ecdh_generic rfkill target_core_user uio target_core_pscsi target_core_file target_core_iblock binfmt_misc 8021q garp stp mrp llc ipmi_ssif rdma_ucm(OE) ib_ipoib(OE) ib_umad(OE) esp6_offload esp6 esp4_offload esp4 xfrm_algo mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul efi_pstore ghash_clmulni_intel nls_ascii pcspkr efivars nls_cp437 vfat fat ast ttm drm_kms_helper drm i2c_algo_bit joydev evdev ccp rng_core sp5100_tco button ipmi_si ipmi_devintf ipmi_msghandler pcc_cpufreq acpi_cpufreq tcp_htcp ib_isert(OE) iscsi_target_mod target_core_mod ib_iser(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_core(OE) libiscsi scsi_transport_iscsi

[193601.190899] configfs scsi_mod knem(OE) efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb raid10 raid1 raid0 multipath linear hid_generic usbhid hid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic md_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper xhci_pci xhci_hcd bnxt_en nvme usbcore mlx5_core(OE) nvme_core mlxfw(OE) mdev(OE) devlink mlx_compat(OE) i2c_piix4 usb_common

[193601.191122] CR2: 00000000000002a8

[193601.191142] —[ end trace deb87b97d623f05b ]—

[193601.297811] RIP: 0010:rdma_disconnect+0xa/0x90 [rdma_cm]

[193601.299287] Code: c0 0f 94 c2 39 83 08 01 00 00 0f 94 c0 38 c2 75 cb 31 c0 eb e1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 53 48 89 fb <48> 8b bf a8 02 00 00 48 85 ff 74 65 48 8b 0b 0f b6 83 c0 01 00 00

[193601.302328] RSP: 0018:ffffb78607837e38 EFLAGS: 00010297

[193601.303850] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002

[193601.305375] RDX: ffff8bfdfd488000 RSI: da6751925f4100e8 RDI: 0000000000000000

[193601.306921] RBP: 0000000000000000 R08: 0000000000025520 R09: ffffffffc0817976

[193601.308429] R10: ffffe4813dd3f800 R11: 0000000000000001 R12: ffff8bfe00217000

[193601.309917] R13: ffff8bfdab114000 R14: ffff8bfe00217350 R15: ffff8bfe002173c0

[193601.311391] FS: 0000000000000000(0000) GS:ffff8bfe0ea00000(0000) knlGS:0000000000000000

[193601.312871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[193601.314355] CR2: 00000000000002a8 CR3: 0000001fc0b6a000 CR4: 0000000000340ef0

Hello Piotr,

Thank you for posting your inquiry on the NVIDIA Networking Community.

Based on the information provided, the f/w you are running is not in sync with the driver version you are running. See following link regarding supported/tested f/w versions with the driver version → https://docs.mellanox.com/display/OFEDv490170/General+Support+in+MLNX_OFED#GeneralSupportinMLNX_OFED-SupportedNICsFirmwareVersions

If after upgrading the f/w to the supported/tested or latest GA f/w version, you still experiencing issues, please do not hesitate to open NVIDIA Networking Support Ticket (valid support contract needed) by sending an email to support@mellanox.com

Thank you and regards,

~NVIDIA Networking Technical Support