Hi,
We are experiencing errors when trying to run large scale MPI application, the application is hanging while from the dmesg log, we cloud find:
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19773): Create QP type 2 failed
[Sun Oct 30 14:28:32 2022] infiniband mlx5_2: create_qp:3206:(pid 19774): Create QP type 2 failed
[Sun Oct 30 14:29:32 2022] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[Sun Oct 30 14:29:37 2022] smp: csd: Detected non-responsive CSD lock (#1) on CPU#64, waiting 5000000011 ns for CPU#13 do_kernel_range_flush+0x0/0x52(0xff136f217f92fb40).
[Sun Oct 30 14:29:37 2022] rcu: 13-…!: (1 GPs behind) idle=252/0/0x1 softirq=40883/40892 fqs=1016
[Sun Oct 30 14:29:37 2022] (detected by 13, t=5011 jiffies, g=124653, q=4182926)
[Sun Oct 30 14:29:37 2022] smp: csd: CSD lock (#1) unresponsive.
[Sun Oct 30 14:29:37 2022] NMI backtrace for cpu 13
[Sun Oct 30 14:29:37 2022] CPU: 13 PID: 0 Comm: swapper/13 Kdump: loaded Tainted: G OE 5.4.17-2136.310.7.el7uek.x86_64 #2
[Sun Oct 30 14:29:37 2022] Hardware name: Oracle Corporation ORACLE SERVER X9-2c/TLA,MB TRAY,X9-2c, BIOS 66040600 07/23/2021
[Sun Oct 30 14:29:37 2022] Call Trace:
[Sun Oct 30 14:29:37 2022]
[Sun Oct 30 14:29:37 2022] smp: csd: Re-sending CSD lock (#1) IPI from CPU#64 to CPU#13
[Sun Oct 30 14:29:37 2022] dump_stack+0x6d/0x8d
[Sun Oct 30 14:29:37 2022] nmi_cpu_backtrace+0x9f/0xa1
[Sun Oct 30 14:29:41 2022] ? lapic_can_unplug_cpu+0xb0/0xa9
[Sun Oct 30 14:29:41 2022] nmi_trigger_cpumask_backtrace+0x80/0x13d
[Sun Oct 30 14:29:41 2022] arch_trigger_cpumask_backtrace+0x19/0x1f
[Sun Oct 30 14:29:41 2022] rcu_dump_cpu_stacks+0x9a/0xce
[Sun Oct 30 14:29:41 2022] rcu_sched_clock_irq+0x815/0x83a
[Sun Oct 30 14:29:41 2022] ? tick_sched_do_timer+0x70/0x6b
[Sun Oct 30 14:29:41 2022] update_process_times+0x28/0x4c
[Sun Oct 30 14:29:41 2022] tick_sched_handle+0x2c/0x62
[Sun Oct 30 14:29:41 2022] tick_sched_timer+0x3c/0x72
[Sun Oct 30 14:29:41 2022] __hrtimer_run_queues+0x106/0x272
[Sun Oct 30 14:29:41 2022] hrtimer_interrupt+0x116/0x244
[Sun Oct 30 14:29:41 2022] smp_apic_timer_interrupt+0x6f/0x13f
[Sun Oct 30 14:29:41 2022] apic_timer_interrupt+0xf/0x14
[Sun Oct 30 14:29:41 2022]
The site is configured with:
- OFI version: 1.13.2(libfabrics)
- MLNX_OFED: 5.7-1.0.2.0
The ibv_devinfo -v is given below output:
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 16.29.1436
node_guid: 043f:7203:00e2:f322
sys_image_guid: 043f:7203:00e2:f322
vendor_id: 0x02c9
vendor_part_id: 4121
hw_ver: 0x0
board_id: ORC0000000003
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 131072
max_qp_wr: 32768
device_cap_flags: 0xed721c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
RAW_IP_CSUM
MANAGED_FLOW_STEERING
Unknown flags: 0xC8400000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 2097152
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
ODP_SUPPORT_IMPLICIT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
xrc_odp_caps:
SUPPORT_SEND
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 78125kHZ
raw packet caps:
C-VLAN stripping offload
Scatter FCS offload
IP csum offload
Delay drop
device_cap_flags_ex: 0x30000055ED721C36
RAW_SCATTER_FCS
PCI_WRITE_END_PADDING
Unknown flags: 0x3000004100000000
tso_caps:
max_tso: 262144
supported_qp:
SUPPORT_RAW_PACKET
rss_caps:
max_rwq_indirection_tables: 16777216
max_rwq_indirection_table_size: 256
rx_hash_function: 0x1
rx_hash_fields_mask: 0x800000FF
supported_qp:
SUPPORT_RAW_PACKET
max_wq_type_rq: 8388608
packet_pacing_caps:
qp_rate_limit_min: 1kbps
qp_rate_limit_max: 100000000kbps
supported_qp:
SUPPORT_RAW_PACKET
tag matching not supported
cq moderation caps:
max_cq_count: 65535
max_cq_period: 4095 us
maximum available device memory: 131072Bytes
num_comp_vectors: 63
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
max_msg_sz: 0x40000000
port_cap_flags: 0x04010000
port_cap_flags2: 0x0000
max_vl_num: invalid value (0)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 1
gid_tbl_len: 256
subnet_timeout: 0
init_type_reply: 0
active_width: 4X (2)
active_speed: 25.0 Gbps (32)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:063f:72ff:fee2:f322, RoCE v1
GID[ 1]: fe80::63f:72ff:fee2:f322, RoCE v2
GID[ 2]: 0000:0000:0000:0000:0000:ffff:c0a8:a801, RoCE v1
GID[ 3]: ::ffff:192.168.168.1, RoCE v2
Can anyone help on it?