Sminfo, ibhosts commands fail on the host nodes

beepunithkumar · April 22, 2024, 2:53pm

We have set of GPU servers in our cluster where commands like sminfo, ibhosts, ibnetdiscover etc fail, also at same time commands like ibdev2netdev, ibdiagnet, ibstat, ibstatus works well on them. All nodes in our cluster run on same hardware / firmware / software stack. We see these MAD RPC errors even after reboot.

ibhosts

ibwarn: [16922] _do_madrpc: send failed; Invalid argument
ibwarn: [16922] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)
/var/tmp/rdma-core/rdma-core-2304mlnx44/libibnetdisc/ibnetdisc.c:811; Failed to resolve self
/usr/sbin/ibnetdiscover: iberror: failed: discover failed

sminfo

ibwarn: [16970] _do_madrpc: send failed; Operation not permitted
ibwarn: [16970] mad_rpc: _do_madrpc failed; dport (Lid 51)
sminfo: iberror: failed: query
root@ATL-GPU-Node-001:/dev/infiniband#

Tracing sminfo pointed to a write failure
write(3</dev/infiniband/umad20>, “\0\0\0\0\0\0\0\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0003\0\0”…, 320) = -1 EPERM (Operation not permitted), but this character device file has same permissions as other nodes (that run ibhosts well).

Let me know, how to resolve this issue?

hyungkwangc · April 24, 2024, 9:57am

Hi

Based on your outputs, it seems opensm is not working.
Please check if opensm is working and the cable to opensm is connected well.

/HyungKwang

beepunithkumar · April 24, 2024, 1:10pm

Hi HyungKwang,

Thanks for the reply.

I checked opensm configuration on nodes failing ibhosts, sminfo commands against the other nodes that succeed, opensm config remains same on both these nodes.

Even though sminfo command fails, ibdiagnet shows the Subnet Manager details (master and standby).

Can you please help confirm what sort of validation or commands can help confirm cabling or other opensm related issue?

Punith

hyungkwangc · April 25, 2024, 1:00am

Hi

please go to SM node. and simply do cable connection with other IB device then do ‘#sminfo’.
if sminfo return failure, you need to re-install OFED and do from the scratch.
without your lab environment info, it’s hard to figure out what your problem is exactly.

Please open a CASE and try to get Technical Support on it.

/HyungKwang

Topic		Replies	Views
opensm failure after reboot, stuck on port initialization Software And Drivers	3	1335	April 12, 2021
Opensm start failed InfiniBand/VPI Adapter Cards hw	1	3332	November 21, 2022
Rdma infiniband cannot open hosts (iberror: discovery failed) Port state: Down RDMA Software For GPU rdma-and-roce , infiniband	2	5043	May 28, 2023
IB card ports are down or polling InfiniBand/VPI Switch Systems	2	6135	July 9, 2023
My `iblinkinfo` dont work on host Software And Drivers bluefield-smartnic , problem	0	1246	February 21, 2022
I am unable to start the opensmd.service.And the ibstat command is not working Mellanox OFED	2	110	March 7, 2025
Error in ipoib Mellanox OFED ib_send_bw , ibping , ibtracert , lsmod	3	1545	March 13, 2016
4.9 OFED driver not working Mellanox OFED	10	1606	February 8, 2023
Ib_sdp {failed} InfiniBand/VPI Adapter Cards service	1	516	March 5, 2018
Problem configuring and using IPoIB Mellanox OFED	7	388	June 22, 2015

Sminfo, ibhosts commands fail on the host nodes

ibhosts

sminfo

Related topics