I have discovered a - let’s say: remarkable - error in the mst ib add command script. Trying to run mlx* commands on a switch resulted in errors like:
$ mlxconfig -d /dev/mst/SW_MT47396_r1i0sw0"_lid-0x0019,mthca0,1 --enable_verbosity q
ibwarn: [1687] _do_madrpc: recv failed: Connection timed out
ibwarn: [1687] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 25)
ibwarn: [1687] _do_madrpc: recv failed: Connection timed out
ibwarn: [1687] mad_rpc: _do_madrpc failed; dport (Lid 25)
ibwarn: [1687] _do_madrpc: recv failed: Connection timed out
ibwarn: [1687] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 25)
-E- ibvsmad : cr access read to Lid 25 failed
FATAL - crspace read (0xf0014) failed: Invalid argument
-E- Failed to identify the device
Querying the switch in the example above revealed that the switch in question has the Lid 19. Apparently, whoever programmed this script believes that converting to hex means generating a string with the decimal number prefixed with ‘0x’. Therefore, accessing the device with the Lid 19 via the mst device results in connections to (inactive) Lid 25 (0x0019). I corrected the script to convert to hex properly - patch attached.
mst_ib_add.py.patch (1.69 KB)