In our project we are launching numerous clients and servers over verbs;ofi_rxm using OFI/libfabrics.
As we exceed certain number of clients, our attempts to connect to servers start failing with server dmesg containing errors of type:
[507263.354558] infiniband mlx5_0: create_qp:2947:(pid 101966): Create QP type 2 failed
Is there a way to determine what type of resource is mlx5_0 running out of? Are there any settings we could teak or additional debug info we could retrieve to figure out the reason for this problem?
OFI version: v1.12.0
Provider used: verbs;ofi_rxm
MOFED version: 5.1.2
System: Frontera@TACC