OpenMPI over PKEY

I’m trying to get openmpi to communicate over a specific partition.
I have a default pkey and a separate pkey partition.
I have 2 compute nodes cn0001,cn0002.
They are set to limited in the default pkey and full in the pkeyA (0xa0).
They can ping and ssh over pkeyA.
PkeyA is on child adapter ib1.00a0. The default partition is on ib1.

When I try to use openmpi over that pkey network, it seems to connect from cn0001-pkey to cn002-pkey but when openmpi reaches back to cn0001, it seems to try and take a different route and hangs. I have tried restricting the network to only ib1.00a0 by using --mca btl_tcp_if_exclude and --mca btl_tcp_if_include, but it doesn’t work.

I know it’s trying to reach back over the default pkey because if I set it to full instead of limited, openmpi works. Is there some switch I’m missing?

Hi jcantrell1,

This is out of my area of expertise and since this forum is focused on the NVHPC compilers, I’m not sure anyone else here will have an answer for you.

I’ve reached out to some folks who might be able to help, but I’m not hopeful they’ll know either.

Since this seems to be an OpenMPI specific question, you might try sending a note to the OpenMPI user’s mailing list. See their help page for details: 2. Getting help — Open MPI main documentation

-Mat

I think I got this figured out. It worked over the default pkey because the MTU was set lower in the subnet manager than it was for pkeyA. For some reason it works with MTU=2048 on that IB partition, but not MTU=4096. I guess that’s a problem for another day.