We have set of GPU servers in our cluster where commands like sminfo, ibhosts, ibnetdiscover etc fail, also at same time commands like ibdev2netdev, ibdiagnet, ibstat, ibstatus works well on them. All nodes in our cluster run on same hardware / firmware / software stack. We see these MAD RPC errors even after reboot.
ibhosts
ibwarn: [16922] _do_madrpc: send failed; Invalid argument
ibwarn: [16922] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)
/var/tmp/rdma-core/rdma-core-2304mlnx44/libibnetdisc/ibnetdisc.c:811; Failed to resolve self
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
sminfo
ibwarn: [16970] _do_madrpc: send failed; Operation not permitted
ibwarn: [16970] mad_rpc: _do_madrpc failed; dport (Lid 51)
sminfo: iberror: failed: query
root@ATL-GPU-Node-001:/dev/infiniband#
Tracing sminfo pointed to a write failure
write(3</dev/infiniband/umad20>, “\0\0\0\0\0\0\0\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0003\0\0”…, 320) = -1 EPERM (Operation not permitted), but this character device file has same permissions as other nodes (that run ibhosts well).
Let me know, how to resolve this issue?