Hi,
I’m a newer to mlnx-ofed.
I’m doing a test on a single node with 4050 clients and a server, find that only 4026 clients can setup up successful, the rest clients all report a error: ibv_create_cq: Invalid argument(22) or ibv_create_qp: Cannot allocate memory(12)
I am using the libfabric with RXM endpoint(RC type), I wonder why I can’t create queues? The current count of qp or cq are both not up to the limit, and the machine memory is sufficient, only 150g is used, more than 350g is available.
I also ran ib_send_lat after the clients were up, it told me to reduce the qp size by decreasing the tx size or inline size. My tx size is 1, so I adjusted FI_VERBS_INLINE_SIZE to 8, but it doesn’t work, the inline size is still 236 when I ran ib_send_lat again.
I also tried “strace”, find that ioctl(fd, RDMA_VERBS_IOCTL, …) returns Cannot allocate memory(12) while calling ibv_create_qp. What seems to be the limitations in the driver?
Thanks in advance for your precious time!
“ulimit -a”:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2060178
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) unlimited
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
“free -g” after all my clients were up
total used free shared buff/cache available
Mem: 503 92 409 0 0 408
Swap: 0 0 0
“rdma res” after all my clients were up:
0: mlx5_0: pd 2 cq 3 qp 1 cm_id 0 mr 0 ctx 4027 srq 2
1: mlx5_1: pd 2 cq 3 qp 1 cm_id 0 mr 0 ctx 4027 srq 2
2: mlx5_2: pd 8056 cq 4030 qp 8053 cm_id 12079 mr 16159 ctx 4027 srq 2
3: mlx5_3: pd 2 cq 3 qp 1 cm_id 0 mr 0 ctx 4027 srq 2
“rdma_client” after all my clients were up:
client: rdma_client -s 192.168.168.1
rdma_client: start
rdma_create_ep: Cannot allocate memory
rdma_client: end -1
“ibstatus mlx5_2”:
Infiniband device 'mlx5_2' port 1 status:
default gid: fe80:0000:0000:0000:bace:f6ff:fe0b:3e94
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet
“mlxfwmanager -d mlx5_2”:
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX5
Part Number: 7359059_MCX556A-EDAS_C14_OCI_Ax_Bx
Description: ConnectX-5 Ex VPI adapter card; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0 x16; tall bracket; ROHS R6
PSID: ORC0000000003
PCI Device Name: mlx5_2
Base MAC: b8cef60b3e94
Versions: Current Available
FW 16.29.1436 N/A
UEFI 14.22.0016 N/A
Status: No matching image found
“ibv_devinfo -d mlx5_2 -v”:
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 16.29.1436
node_guid: 043f:7203:00e2:f322
sys_image_guid: 043f:7203:00e2:f322
vendor_id: 0x02c9
vendor_part_id: 4121
hw_ver: 0x0
board_id: ORC0000000003
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 131072
max_qp_wr: 32768
device_cap_flags: 0xed721c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
RAW_IP_CSUM
MANAGED_FLOW_STEERING
Unknown flags: 0xC8400000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 2097152
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
ODP_SUPPORT_IMPLICIT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
uc_odp_caps: NO SUPPORT
ud_odp_caps: SUPPORT_SEND
xrc_odp_caps:
SUPPORT_SEND
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 78125kHZ
raw packet caps:
C-VLAN stripping offload
Scatter FCS offload
IP csum offload
Delay drop
device_cap_flags_ex: 0x30000055ED721C36
RAW_SCATTER_FCS
PCI_WRITE_END_PADDING
Unknown flags: 0x3000004100000000
tso_caps:
max_tso: 262144
supported_qp: SUPPORT_RAW_PACKET
rss_caps:
max_rwq_indirection_tables: 16777216
max_rwq_indirection_table_size: 256
rx_hash_function: 0x1
rx_hash_fields_mask: 0x800000FF
supported_qp:
SUPPORT_RAW_PACKET
max_wq_type_rq: 8388608
packet_pacing_caps:
qp_rate_limit_min: 1kbps
qp_rate_limit_max: 100000000kbps
supported_qp:
SUPPORT_RAW_PACKET
tag matching not supported
cq moderation caps:
max_cq_count: 65535
max_cq_period: 4095 us
maximum available device memory: 131072Bytes
num_comp_vectors: 63
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
max_msg_sz: 0x40000000
port_cap_flags: 0x04010000
port_cap_flags2: 0x0000
max_vl_num: invalid value (0)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 1
gid_tbl_len: 256
subnet_timeout: 0
init_type_reply: 0
active_width: 4X (2)
active_speed: 25.0 Gbps (32)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:063f:72ff:fee2:f322, RoCE v1
GID[ 1]: fe80::63f:72ff:fee2:f322, RoCE v2
GID[ 2]: 0000:0000:0000:0000:0000:ffff:c0a8:a801, RoCE v1
GID[ 3]: ::ffff:192.168.168.1, RoCE v2