Sminfo, ibhosts commands fail on the host nodes

We have set of GPU servers in our cluster where commands like sminfo, ibhosts, ibnetdiscover etc fail, also at same time commands like ibdev2netdev, ibdiagnet, ibstat, ibstatus works well on them. All nodes in our cluster run on same hardware / firmware / software stack. We see these MAD RPC errors even after reboot.

ibhosts

ibwarn: [16922] _do_madrpc: send failed; Invalid argument
ibwarn: [16922] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)
/var/tmp/rdma-core/rdma-core-2304mlnx44/libibnetdisc/ibnetdisc.c:811; Failed to resolve self
/usr/sbin/ibnetdiscover: iberror: failed: discover failed

sminfo

ibwarn: [16970] _do_madrpc: send failed; Operation not permitted
ibwarn: [16970] mad_rpc: _do_madrpc failed; dport (Lid 51)
sminfo: iberror: failed: query
root@ATL-GPU-Node-001:/dev/infiniband#

Tracing sminfo pointed to a write failure
write(3</dev/infiniband/umad20>, “\0\0\0\0\0\0\0\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0003\0\0”…, 320) = -1 EPERM (Operation not permitted), but this character device file has same permissions as other nodes (that run ibhosts well).

Let me know, how to resolve this issue?

Hi

Based on your outputs, it seems opensm is not working.
Please check if opensm is working and the cable to opensm is connected well.

/HyungKwang

Hi HyungKwang,

Thanks for the reply.

I checked opensm configuration on nodes failing ibhosts, sminfo commands against the other nodes that succeed, opensm config remains same on both these nodes.

Even though sminfo command fails, ibdiagnet shows the Subnet Manager details (master and standby).

Can you please help confirm what sort of validation or commands can help confirm cabling or other opensm related issue?

Punith

Hi

please go to SM node. and simply do cable connection with other IB device then do ‘#sminfo’.
if sminfo return failure, you need to re-install OFED and do from the scratch.
without your lab environment info, it’s hard to figure out what your problem is exactly.

Please open a CASE and try to get Technical Support on it.

/HyungKwang