MLNX_OFED_LINUX-5.7-1: mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed

Hi,

We are experiencing errors when trying to run large scale MPI application, the application is hanging while from the dmesg log, we cloud find:
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:29:32 2022] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[Sun Oct 30 14:29:37 2022] smp: csd: Detected non-responsive CSD lock (#1) on CPU#64, waiting 5000000011 ns for CPU#13 do_kernel_range_flush+0x0/0x52(0xff136f217f92fb40).
[Sun Oct 30 14:29:37 2022] rcu: 13-…!: (1 GPs behind) idle=252/0/0x1 softirq=40883/40892 fqs=1016
[Sun Oct 30 14:29:37 2022] (detected by 13, t=5011 jiffies, g=124653, q=4182926)
[Sun Oct 30 14:29:37 2022] smp: csd: CSD lock (#1) unresponsive.
[Sun Oct 30 14:29:37 2022] NMI backtrace for cpu 13
[Sun Oct 30 14:29:37 2022] CPU: 13 PID: 0 Comm: swapper/13 Kdump: loaded Tainted: G OE 5.4.17-2136.310.7.el7uek.x86_64 #2
[Sun Oct 30 14:29:37 2022] Hardware name: Oracle Corporation ORACLE SERVER X9-2c/TLA,MB TRAY,X9-2c, BIOS 66040600 07/23/2021
[Sun Oct 30 14:29:37 2022] Call Trace:
[Sun Oct 30 14:29:37 2022]
[Sun Oct 30 14:29:37 2022] smp: csd: Re-sending CSD lock (#1) IPI from CPU#64 to CPU#13
[Sun Oct 30 14:29:37 2022] dump_stack+0x6d/0x8d
[Sun Oct 30 14:29:37 2022] nmi_cpu_backtrace+0x9f/0xa1
[Sun Oct 30 14:29:41 2022] ? lapic_can_unplug_cpu+0xb0/0xa9
[Sun Oct 30 14:29:41 2022] nmi_trigger_cpumask_backtrace+0x80/0x13d
[Sun Oct 30 14:29:41 2022] arch_trigger_cpumask_backtrace+0x19/0x1f
[Sun Oct 30 14:29:41 2022] rcu_dump_cpu_stacks+0x9a/0xce
[Sun Oct 30 14:29:41 2022] rcu_sched_clock_irq+0x815/0x83a
[Sun Oct 30 14:29:41 2022] ? tick_sched_do_timer+0x70/0x6b
[Sun Oct 30 14:29:41 2022] update_process_times+0x28/0x4c
[Sun Oct 30 14:29:41 2022] tick_sched_handle+0x2c/0x62
[Sun Oct 30 14:29:41 2022] tick_sched_timer+0x3c/0x72
[Sun Oct 30 14:29:41 2022] __hrtimer_run_queues+0x106/0x272
[Sun Oct 30 14:29:41 2022] hrtimer_interrupt+0x116/0x244
[Sun Oct 30 14:29:41 2022] smp_apic_timer_interrupt+0x6f/0x13f
[Sun Oct 30 14:29:41 2022] apic_timer_interrupt+0xf/0x14
[Sun Oct 30 14:29:41 2022]

The site is configured with:

  1. OFI version: 1.13.2(libfabrics)
  2. MLNX_OFED: 5.7-1.0.2.0

The ibv_devinfo -v is given below output:

hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 16.29.1436
node_guid: 043f:7203:00e2:f322
sys_image_guid: 043f:7203:00e2:f322
vendor_id: 0x02c9
vendor_part_id: 4121
hw_ver: 0x0
board_id: ORC0000000003
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 131072
max_qp_wr: 32768
device_cap_flags: 0xed721c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
RAW_IP_CSUM
MANAGED_FLOW_STEERING
Unknown flags: 0xC8400000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 2097152
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
ODP_SUPPORT_IMPLICIT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
xrc_odp_caps:
SUPPORT_SEND
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 78125kHZ
raw packet caps:
C-VLAN stripping offload
Scatter FCS offload
IP csum offload
Delay drop
device_cap_flags_ex: 0x30000055ED721C36
RAW_SCATTER_FCS
PCI_WRITE_END_PADDING
Unknown flags: 0x3000004100000000
tso_caps:
max_tso: 262144
supported_qp:
SUPPORT_RAW_PACKET
rss_caps:
max_rwq_indirection_tables: 16777216
max_rwq_indirection_table_size: 256
rx_hash_function: 0x1
rx_hash_fields_mask: 0x800000FF
supported_qp:
SUPPORT_RAW_PACKET
max_wq_type_rq: 8388608
packet_pacing_caps:
qp_rate_limit_min: 1kbps
qp_rate_limit_max: 100000000kbps
supported_qp:
SUPPORT_RAW_PACKET
tag matching not supported

    cq moderation caps:
            max_cq_count:   65535
            max_cq_period:  4095 us

    maximum available device memory:        131072Bytes

    num_comp_vectors:               63
            port:   1
                    state:                  PORT_ACTIVE (4)
                    max_mtu:                4096 (5)
                    active_mtu:             4096 (5)
                    sm_lid:                 0
                    port_lid:               0
                    port_lmc:               0x00
                    link_layer:             Ethernet
                    max_msg_sz:             0x40000000
                    port_cap_flags:         0x04010000
                    port_cap_flags2:        0x0000
                    max_vl_num:             invalid value (0)
                    bad_pkey_cntr:          0x0
                    qkey_viol_cntr:         0x0
                    sm_sl:                  0
                    pkey_tbl_len:           1
                    gid_tbl_len:            256
                    subnet_timeout:         0
                    init_type_reply:        0
                    active_width:           4X (2)
                    active_speed:           25.0 Gbps (32)
                    phys_state:             LINK_UP (5)
                    GID[  0]:               fe80:0000:0000:0000:063f:72ff:fee2:f322, RoCE v1
                    GID[  1]:               fe80::63f:72ff:fee2:f322, RoCE v2
                    GID[  2]:               0000:0000:0000:0000:0000:ffff:c0a8:a801, RoCE v1
                    GID[  3]:               ::ffff:192.168.168.1, RoCE v2

Can anyone help on it?

Hi ,

Please note that we had several QP bugs in older firmware versions , that were fixed in newer versions .
Therefore I recommend contact Oracle Support (since its OEM card) so they can provide you with
firmware 16.32.XXXX or newer

Thanks,
Samer