Hi, i’m emulating a Machine Vision Camera transferring images by using ConnectX-4 and RoCE.
For every image, the sender creates a memory regions and after the transfer has finished, the memory regions are destroyed.
After a while, about 55000 frames (three transfers per frame) i got the following kernel error.
When this error appears, no more use of the ConnectX-4 is possible. The system needs to be restarted.
Is there a limit (by the hardware, firmware or driver) creating and destroying memory regions?
Are there any ideas, how to avoid to run into this failure state?
Thanks in advance!
Host PC is Ubuntu 22.04.3 LTS
and
Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Nov 02 11:03:28 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:28 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:28 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:28 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:28 AP0180 kernel: ------------[ cut here ]------------
Nov 02 11:03:28 AP0180 kernel: irq 145 handler irq_int_handler+0x0/0x30 [mlx5_core] enabled interrupts
Nov 02 11:03:28 AP0180 kernel: WARNING: CPU: 6 PID: 0 at kernel/irq/handle.c:161 __handle_irq_event_percpu+0x174/0x1b0
Nov 02 11:03:28 AP0180 kernel: Modules linked in: cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic binfmt_misc snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel rpcrdma snd_intel_dspcfg i915 snd_intel_sdw_acpi sunrpc snd_hda_codec intel_rapl_msr intel_rapl_common snd_hda_core rdma_ucm intel_tcc_cooling x86_pkg_temp_thermal snd_hwdep ib_iser intel_powerclamp snd_pcm coretemp libiscsi drm_buddy ttm snd_seq_midi scsi_transport_iscsi kvm_intel ib_umad snd_seq_midi_event rdma_cm nls_iso8859_1 mei_hdcp mei_pxp ib_ipoib drm_display_helper snd_rawmidi kvm iw_cm ib_cm cec irqbypass rc_core dell_wmi dell_smm_hwmon snd_seq crct10dif_pclmul polyval_clmulni drm_kms_helper
Nov 02 11:03:28 AP0180 kernel: snd_seq_device polyval_generic snd_timer ghash_clmulni_intel joydev dell_smbios i2c_algo_bit sha512_ssse3 aesni_intel cmdlinepart crypto_simd snd cryptd spi_nor dcdbas syscopyarea ftdi_sio mei_me sysfillrect rapl uio_netx dell_wmi_sysman ledtrig_audio dell_wmi_aio intel_cstate input_leds usbserial sysimgblt dell_wmi_descriptor intel_wmi_thunderbolt sparse_keymap firmware_attributes_class wmi_bmof mtd uio soundcore mei ee1004 intel_pch_thermal mac_hid acpi_pad sch_fq_codel msr parport_pc ppdev lp parport drm efi_pstore ip_tables x_tables autofs4 hid_logitech_hidpp mlx5_ib ib_uverbs ib_core hid_logitech_dj hid_generic usbhid hid uas usb_storage mlx5_core crc32_pclmul mlxfw i2c_i801 e1000e spi_intel_pci psample spi_intel i2c_smbus intel_lpss_pci tls ahci intel_lpss xhci_pci libahci idma64 pci_hyperv_intf xhci_pci_renesas video wmi pinctrl_cannonlake
Nov 02 11:03:28 AP0180 kernel: CPU: 6 PID: 0 Comm: swapper/6 Not tainted 6.2.0-36-generic #37~22.04.1-Ubuntu
Nov 02 11:03:28 AP0180 kernel: Hardware name: Dell Inc. Precision 3630 Tower/0NNNCT, BIOS 2.15.0 07/04/2022
Nov 02 11:03:28 AP0180 kernel: RIP: 0010:__handle_irq_event_percpu+0x174/0x1b0
Nov 02 11:03:28 AP0180 kernel: Code: 05 9b e9 41 02 3c 01 0f 87 08 15 ea 00 a8 01 75 1b 48 8b 13 44 89 f6 48 c7 c7 a8 66 d3 b6 c6 05 7b e9 41 02 01 e8 0c 4f f5 ff <0f> 0b fa 0f 1f 44 00 00 e9 e2 fe ff ff f0 48 0f ba 6b 40 01 0f 82
Nov 02 11:03:28 AP0180 kernel: RSP: 0018:ffff9a2bc02a0f38 EFLAGS: 00010246
Nov 02 11:03:28 AP0180 kernel: RAX: 0000000000000000 RBX: ffff88eac4918280 RCX: 0000000000000000
Nov 02 11:03:28 AP0180 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Nov 02 11:03:28 AP0180 kernel: RBP: ffff9a2bc02a0f68 R08: 0000000000000000 R09: 0000000000000000
Nov 02 11:03:28 AP0180 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
Nov 02 11:03:28 AP0180 kernel: R13: 0000000000000000 R14: 0000000000000091 R15: ffff88ead2ba4e00
Nov 02 11:03:28 AP0180 kernel: FS: 0000000000000000(0000) GS:ffff88f21c380000(0000) knlGS:0000000000000000
Nov 02 11:03:28 AP0180 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 02 11:03:28 AP0180 kernel: CR2: 000000c00041d010 CR3: 0000000502810001 CR4: 00000000003706e0
Nov 02 11:03:28 AP0180 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 02 11:03:28 AP0180 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 02 11:03:28 AP0180 kernel: Call Trace:
Nov 02 11:03:28 AP0180 kernel: <IRQ>
Nov 02 11:03:28 AP0180 kernel: ? show_regs+0x72/0x90
Nov 02 11:03:28 AP0180 kernel: ? __handle_irq_event_percpu+0x174/0x1b0
Nov 02 11:03:28 AP0180 kernel: ? __warn+0x8d/0x160
Nov 02 11:03:28 AP0180 kernel: ? __handle_irq_event_percpu+0x174/0x1b0
Nov 02 11:03:28 AP0180 kernel: ? report_bug+0x1bb/0x1d0
Nov 02 11:03:28 AP0180 kernel: ? handle_bug+0x46/0x90
Nov 02 11:03:28 AP0180 kernel: ? exc_invalid_op+0x19/0x80
Nov 02 11:03:28 AP0180 kernel: ? asm_exc_invalid_op+0x1b/0x20
Nov 02 11:03:28 AP0180 kernel: ? __handle_irq_event_percpu+0x174/0x1b0
Nov 02 11:03:28 AP0180 kernel: ? __handle_irq_event_percpu+0x174/0x1b0
Nov 02 11:03:28 AP0180 kernel: handle_irq_event+0x39/0x80
Nov 02 11:03:28 AP0180 kernel: handle_edge_irq+0x8c/0x250
Nov 02 11:03:28 AP0180 kernel: __common_interrupt+0x4f/0x110
Nov 02 11:03:28 AP0180 kernel: common_interrupt+0x9f/0xb0
Nov 02 11:03:28 AP0180 kernel: </IRQ>
Nov 02 11:03:28 AP0180 kernel: <TASK>
Nov 02 11:03:28 AP0180 kernel: asm_common_interrupt+0x27/0x40
Nov 02 11:03:28 AP0180 kernel: RIP: 0010:cpuidle_enter_state+0xde/0x6f0
Nov 02 11:03:28 AP0180 kernel: Code: 4f f1 49 e8 94 1a 45 ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 92 f8 43 ff 80 7d d0 00 0f 85 e8 00 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 0f 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c4 04 00 00
Nov 02 11:03:28 AP0180 kernel: RSP: 0018:ffff9a2bc0137e28 EFLAGS: 00000246
Nov 02 11:03:28 AP0180 kernel: RAX: 0000000000000000 RBX: ffffba2bbfb80100 RCX: 0000000000000000
Nov 02 11:03:28 AP0180 kernel: RDX: 0000000000000006 RSI: 0000000000000000 RDI: 0000000000000000
Nov 02 11:03:28 AP0180 kernel: RBP: ffff9a2bc0137e78 R08: 0000000000000000 R09: 0000000000000000
Nov 02 11:03:28 AP0180 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffb78c2f80
Nov 02 11:03:28 AP0180 kernel: R13: 0000000000000003 R14: 0000000000000003 R15: 000001dbaac76ab4
Nov 02 11:03:28 AP0180 kernel: ? cpuidle_enter_state+0xce/0x6f0
Nov 02 11:03:28 AP0180 kernel: cpuidle_enter+0x2e/0x50
Nov 02 11:03:28 AP0180 kernel: cpuidle_idle_call+0x14f/0x1e0
Nov 02 11:03:28 AP0180 kernel: do_idle+0x82/0x110
Nov 02 11:03:28 AP0180 kernel: cpu_startup_entry+0x20/0x30
Nov 02 11:03:28 AP0180 kernel: start_secondary+0x138/0x170
Nov 02 11:03:28 AP0180 kernel: secondary_startup_64_no_verify+0xe5/0xeb
Nov 02 11:03:28 AP0180 kernel: </TASK>
Nov 02 11:03:28 AP0180 kernel: ---[ end trace 0000000000000000 ]---
Nov 02 11:03:29 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:29 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:30 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:30 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:31 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:31 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:33 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:33 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:34 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:34 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:35 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 2624): async reg mr failed. status -121
Nov 02 11:03:35 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:36 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:36 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:37 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:37 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:38 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:38 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:38 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:38 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:38 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:38 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:38 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: mlx5_cmd_out_err: 21 callbacks suppressed
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:39 AP0180 kernel: mlx5_core 0000:01:00.0: mlx5_cmd_out_err:779:(pid 2624): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4), err(-12)
Nov 02 11:03:40 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 2624): async reg mr failed. status -121
Nov 02 11:03:41 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 0): async reg mr failed. status -121
Nov 02 11:03:42 AP0180 kernel: infiniband rocep1s0: create_mkey_warn:137:(pid 423): async reg mr failed. status -121