I am seeking help with an issue I am experiencing while working on a cluster using ConnectX-5 devices. I want to assign 64 PKeys for each server, but I’m running into some problems. Here’s what I have done so far:
I created a partitions.conf file, with each line formatted like =0x80**,ipoib:ALL=full
After applying the partitions.conf, OpenSM complains that the switch has only 8 PKey capacity.
I then created 64 IPoIB interfaces corresponding to the 64 PKeys on two servers.
The first 32 PKeys work well, but the remaining 32 do not. “Ping” shows network unreachable, and OpenSM complains with log_trap_info: Received Generic Notice type:2 num:259 (Bad P_Key (switch external port)) Producer:2 (Switch) from LID:3 TID:xxx
I modified the partitions.conf file, changing each line to =0x80**,ipoib:ALL_CAS=full
OpenSM no longer complains about the 8 PKey capacity limit.
However, the network connectivity issue persists. The first 32 PKeys still work well, but the rest do not. “Ping” shows network unreachable, and OpenSM still complains with the same error message as before.
I would appreciate any guidance or suggestions to resolve this issue and successfully assign 64 PKeys for each server in the cluster. Thank you in advance for your assistance!
I tried to run smpquery pkeytables / smpquery pkeytable. It didn’t work.
$ sudo smpquery pkeytables
Usage: smpquery [options] <op> <dest dr_path|lid|guid> [op params]
Supported ops (and aliases, case insensitive):
NodeInfo (NI) <addr>
NodeDesc (ND) <addr>
PortInfo (PI) <addr> [<portnum>]
PortInfoExtended (PIE) <addr> [<portnum>]
SwitchInfo (SI) <addr>
PKeyTable (PKeys) <addr> [<portnum>]
SL2VLTable (SL2VL) <addr> [<portnum>]
VLArbitration (VLArb) <addr> [<portnum>]
GUIDInfo (GI) <addr>
MlnxExtPortInfo (MEPI) <addr> [<portnum>]
Options:
--combined, -c use Combined route address argument
--node-name-map <file> node name map file
--extended, -x use extended speeds
--config, -z <config> use config file, default: /etc/infiniband-diags/ibdiag.conf
--Ca, -C <ca> Ca name to use
--Port, -P <port> Ca port number to use
--Direct, -D use Direct address argument
--Lid, -L use LID address argument
--Guid, -G use GUID address argument
--timeout, -t <ms> timeout in ms
--sm_port, -s <lid> SM port lid
--show_keys, -K display security keys in output
--m_key, -y <key> M_Key to use in request
--errors, -e show send and receive errors
--verbose, -v increase verbosity level
--debug, -d raise debug level
--help, -h help message
--version, -V show version
Examples:
smpquery portinfo 3 1 # portinfo by lid, with port modifier
smpquery -G switchinfo 0x2C9000100D051 1 # switchinfo by guid
smpquery -D nodeinfo 0 # nodeinfo by direct route
smpquery -c nodeinfo 6 0,12 # nodeinfo by combined route
I think it is because 1. pkeytables should be pkeytable, 2. there must be a <dest dr_path|lid|guid>.
I cannot get your point. Do you mean the smpquery command needs to run on the switch? However, QM8790 is a externally managed switch. It seems that I cannot login to the switch and execute commands.