OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

I’ve installed MLNX_OFED_LINUX-5.5-1.0.3.2, including the user space programs and libraries, on two nodes of my cluster, in order to validate the performance before updating whole cluster to Mellanox OFED. The cluster runs PBS 18.1.4. If I allocate nodes for interactive use, and run tests like ib_read_bw or ucx_perftest manually, then these tests work fine and bandwidth is as expected. However, when I try to use OpenMPI delivered along with the MLNX_OFED distribution, through using /usr/mpi/gcc/openmpi-4.1.2rc2/bin/mpirun from a PBS job submission script setup to allocate two nodes, and run a MPI rank on each node, the problem is that PBS sets OMPI_COMM_WORLD_LOCAL_SIZE variable to 2 for each of my ranks, instead of 1. The $PBS_NODEFILE content turns out OK, namely it indeed lists both nodes requested, and the ranks do get run on different nodes, but this variable is just wrong (and also OMPI_COMM_WORLD_LOCAL_RANK is set 0 and 1 for the two ranks, respectively, instead of 0 for both). So, is there any known incompatibility between mentioned versions of PBS and MLNX_OFED, or - any other hint how to fix this issue?

Thanks.

OK, got in the meantime that OpenMPI delivered along with MLNX_OFED is not built with tm support, so the MPI ranks in my case actually get all assigned to the single node, and OMPI_COMM_WORLD_LOCAL_SIZE and OMPI_COMM_WORLD_LOCAL_RANK are correct. Namely, the problem is that nodes are read from PBS generated file that lists one node per line, but since OpenMPI is built without tm support the mpirun doesn’t check info about node allocations, and considers that the first node in the list from the given hostfile has as many slots as it has cores, and thus both ranks ends up on this node.

So I guess my question becomes: is it supposed that this OpenMPI version should not be used along with Torque/PBS, or are there some workarounds?

In the meantime, I’ve built my own OpenMPI version, using “–with-tm” flag to activate PBS support, and it works fine; I kept using UCX delivered along with MLNX_OFED.

It doesn’t seem Mellanox is listening much to this community forum, but nevertheless I’d like to conclude with an appeal to have OpenMPI that is included in MLNX_OFED built with “–with-tm” flag in the future. If I understood it correctly, there is no harm in building with Torque/PBS support, and most of the Linux distributions build with this flag anyway, so it would be good to have it activated for MLNX_OFED build of OpenMPI too.

Hello,

The openMPI does not come compiled with the --with-tm flag for Torque/PBS support because we don’t compile with proprietary deps, but only open public source but you can rebuild openmpi with PBS support as you did, the sources are part of our HPC-X package.

Best Regards,

Viki

1 Like