Request support/help for PBS with OpenMPI

Hello,

I have been having difficulties running my MPI+OpenACC code on multiple GPU nodes on NASA’s Pleiades machine. It seems that the openmpi library include with PGI does not contain support for the PBS scheduler.

Is there a way to have this included?

If not, is there a way to have the source code of openmpi included along with the configuration line used by PGI so we can easily modify it to include PBS through the “–with-tm=” flag?

Failing that, could you give me the configuration option line that is used with the included openmpi so I can compile from source by downloading openmpi?

I have tried:
./configure --with-tm=/PBS --with-cuda=/path/to/pgi/cuda/10.0 --with-wrapper-cflags=“-D__LP64__ -ta:tesla” --prefix=/path/to/where/i/want/openmpi

with openmpi 4.0.1 but it fails with device initialization errors.

Am I missing something in the config options? What options does PGI use?

Thanks,

  • Ron

Hi Ron,

Here is how we configure our builds here at PGI for Open MPI. Note that we ship Open MPI 3.1.3 right now, but most of the config options should be the same. Within the openmpi-3.1.3 directory:

mkdir build

cd build

../configure --enable-shared --enable-static --without-tm --enable-mpi-cxx --disable-wrapper-runpath --with-cuda=/path/to/cuda --prefix=/path/to/install/dir CC=pgcc CXX=pgc++ FC=pgfortran CPP=cpp CFLAGS="-O1" CXXFLAGS="-O1" FCFLAGS="-O1"

Obviously you will want to change the --without-tm flag out for PBS support. We do not normally build our Open MPI builds with PBS support for production here, as this would introduce a runtime dependency on libtm, and customers who do not wish to use PBS would still be required to install it onto their systems.

If you still run into issues with your Open MPI after trying the above, let us know here.

Regards,

+chris

Hi,

I tried using your configuration line and now I can run on multiple nodes!

However, it runs very slow.

For example, running on a single 4xV100 node takes 832 seconds, and running on a single 8xV100 node takes 698 seconds (the problem is small so does not scale well).
However, when I try to run on 2 4xV100 nodes (8 total) it takes 1799 seconds.

I noticed that I am getting the messages:
7 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[r101i0n1:07676] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
[r101i0n1:07676] 7 more processes have sent help message help-mpi-btl-openib.txt / error in device init

Do you know if there is a separate configuration flag to point to the infiniband somehow?

Any ideas on what could be happening?

  • Ron

Hi Ron,

Sorry, I guess I should have mentioned that! We don’t explicitly pass flags to build Open MPI with OFED support, because Open MPI’s configure script automatically detects the OFED libraries in the regular system library directories on our build system.

If you are using OFED and have the files installed in their own directory that the system is not picking up by default, you can also explicitly pass --with-verbs=/path/to/verbs/dir to tell Open MPI where to find them.

If your cluster has UCX installed to manage the interconnect communication on your cluster, you can pass --with-ucx=/path/to/ucx/dir here instead or in addition to this.

I am not sure which network type you are using, but this FAQ entry lists all the possible network configuration flags that Open MPI supports:

https://www.open-mpi.org/faq/?category=building#build-p2p

Hope this helps.

+chris

Hi,

So it turns out the openmpi was in fact detecting the infiniband (I think) as the code does run on multiple nodes (just slow).

I have further tested the code on both PLEIADES at NASA HECC and COMET at SDSC.
COMET is set up to support GPUdirect RDMA but when running under openmpi and PGI,
it seems that this feature is not being supported/activated (at least according to the output of openmpi).
On PLEIADES, there is no support at all for GPUdirect RDMA.

To see if the slow run speeds on PLEIADES is due to the lack of GPUdirect, I ran the
same simulation with the same code on COMET using multiple nodes.

To re-cap, the simulation times on PLEIADES are as follows:

PLEAIDES:
1 NODE, 1xV100: 2421.9
1 NODE, 4xV100: 832.1
1 NODE, 8xV100: 698.1
2 NODES, 4xV100 each: 1798.5

Ignoring the poor scaling of the run to 8 GPUs (the size of the problem is small)
we see that using 8 GPUs with 2 nodes (4 each) is over twice as slow as using 8 GPUs on one node.

Switching over to COMET, we find:

COMET:

1 NODE, 1xP100: 3227.3
1 NODE, 4xP100: 1170.6
2 NODES, 2xP100 each: 1170.7
2 NODES, 4xP100 each: 967.3
4 NODES, 2xP100 each: 923.4

Here we see that using the same number of GPUs on 1 node versus multiple nodes
yields almost the same run-times. In fact, running on 4 nodes with 4 GPUs each,
yields a communication time of 464.3 while running on 2 nodes has a communication time of 428.1.
Therefore, the overhead of communication between nodes is not that bad.


To try to understand what is going on, I checked out the result of using nvidia-smi -topo m.
On COMET, I get:

GPU0 GPU1 GPU2 GPU3 mlx4_0 CPU Affinity
GPU0 X PIX SYS SYS PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26
GPU1 PIX X SYS SYS PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26
GPU2 SYS SYS X PIX SYS 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27
GPU3 SYS SYS PIX X SYS 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27
mlx4_0 PHB PHB SYS SYS X

while on PLEIADES, I get:

GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity
GPU0 X NV2 NV2 SYS NODE NODE SYS SYS 0-17
GPU1 NV2 X SYS NV1 PIX PIX SYS SYS 0-17
GPU2 NV2 SYS X NV2 SYS SYS NODE NODE 18-35
GPU3 SYS NV1 NV2 X SYS SYS PIX PIX 18-35
mlx5_0 NODE PIX SYS SYS X PIX SYS SYS
mlx5_1 NODE PIX SYS SYS PIX X SYS SYS
mlx5_2 SYS SYS NODE PIX SYS SYS X PIX
mlx5_3 SYS SYS NODE PIX SYS SYS PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks



On COMET, everything seems reasonable in that each GPU sees its partner on the socket as PIX, and the
other 2 GPUs on the other socket as SYS.

However, on PLEAIDES, the results seem strange. It looks like each GPU on the node has
NVlink access to 2 other GPUs and SYS access to the last GPU. Does this imply a 3-way NVlink is being used?

Does this information help in diagnosing the slow speeds when using multiple GPU nodes on PLEIADES?
From the topology, I would not expect using multiple nodes to be much slower than a single node
with the same number of GPUs, since COMET also was not using GPUdirect and shows much better performance over the
network.

I have tried numerous openmpi flags and bindings, but they do not seem to help the run-times.

Thanks,

Ron

Hi Ron,

This is a bit beyond either my or Chris area of expertise, but Chris is going to reach out to other folks within NVIDIA to see if they might have ideas.

Also, you may see if NASA can get in contact with the NVIDIA Solution Architect (SA) assigned to their account (not sure who it is though). SA’s should be better able provide insights into hardware and network issues.

-Mat

Hi Ron,

Just FYI - we are primarily a compiler development and support organization here, so some of your concerns regarding Open MPI performance may fall a bit outside of our area of experience. However, I have reached out to some other people within NVIDIA who can hopefully help us out with these concerns you have raised here.

One of our Open MPI engineers immediately responded back with two suggestions that I wanted to pass along:

  1. He is concerned that your PLEIADES cluster may not be using InfiniBand for the Open MPI transport. Here is what he said:

Reading the thread, this comment is worrying :


« So it turns out the openmpi was in fact detecting the infiniband (I think) as the code does run on multiple nodes (just slow). »


No, this is actually a good sign Infiniband was not detected and he’s running on TCP/IP.


This is how you check the openib BTL is there (in Open MPI 1.10.7) and force it to be used :

$ ompi_info | grep openib
MCA btl: openib (MCA v2.0.0, API v2.0.0, Component v1.10.7)
$ mpirun -mca btl openib,smcuda,self …

>
> And this is for UCX (in Open MPI 4.0.0) :
>
> ```text
$ ompi_info | grep ucx
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.0.0)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.0)
$ mpirun -mca pml ucx ...
  1. He also mentioned that the previous support of InfiniBand in Open MPI is deprecated as of Open MPI 4.x, and the Open MPI developers are recommending that everyone switch to using UCX instead, and then building Open MPI against UCX. UCX will manage the InfiniBand transport for Open MPI going forward, rather than having Open MPI manage it directly. Based on their recommendations, we are actually planning to update our bundled Open MPI to a 4.x release and change the configuration to use UCX with the 20.1 release, due in early 2020.

This link provides a good overview of how to build UCX and link Open MPI against it:

Note that you do not need to clone Open MPI source from their github repo as described in the OpenMPI and OpenSHMEM Installation section. It should work fine with Open MPI 4.0.1, for example. You will need to be sure you are using an up-to-date version of UCX, though.

Hopefully this will give you some ideas to try, going forward.

Good luck,

+chris

Hi Ron,

Here is a step-by-step guide to configuring Open MPI with UCX that another NVIDIA engineer just passed along to me:

  1. Setup UCX
    =============


    a. Download and build gdrcopy (optional but recommended)
    git clone > GitHub - NVIDIA/gdrcopy: A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology > # (see instructions in > GitHub - NVIDIA/gdrcopy: A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology> )
    cd gdrcopy/
    sudo make PREFIX=/usr CUDA=/usr/local/cuda all install
    sudo ./insmod.sh

    \

Copy library .so and header to /usr or wherever you decide GDRCOPY_HOME is

sudo cp libgdrapi.so.1 /usr/lib64/
sudo cp libgdrapi.so /usr/lib64/
sudo cp libgdrapi.so.1.4 /usr/lib64/
sudo cp gdrapi.h /usr/include




b. Download UCX
Either download the latest release here: > Releases · openucx/ucx · GitHub > or clone master branch
git clone > GitHub - openucx/ucx: Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)

c. Build UCX with cuda-support
UCX_HOME=/usr/local/ucx
CUDA_HOME=/usr/local/cuda
GDRCOPY_HOME=/usr

./autogen.sh # if configure isn’t present
sudo apt-get install libnuma-dev # if libnuma-dev isn’t installed
./configure --prefix=$UCX_HOME --with-cuda=$CUDA_HOME --with-gdrcopy=$GDRCOPY_HOME --enable-mt
sudo make -j install

\

Export paths to access binaries and libraries later

export LD_LIBRARY_PATH=$GDRCOPY_HOME/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$UCX_HOME/lib:$LD_LIBRARY_PATH
export PATH=$UCX_HOME/bin:$PATH




2. Setup OpenMPI



a. Download OpenMPI
Either download the latest release here: > Open MPI: Version 4.0 > (recommended)
wget > https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
tar xfz openmpi-4.0.1.tar.gz
cd openmpi-4.0.1


or clone master branch
git clone > GitHub - open-mpi/ompi: Open MPI main development repository


b. Build OpenMPI with cuda-support
OMPI_HOME=/usr/local/openmpi
./autogen.pl # if configure isn’t present
./configure --prefix=$OMPI_HOME --enable-mpirun-prefix-by-default --with-cuda=$CUDA_HOME --with-ucx=$UCX_HOME --with-ucx-libdir=$UCX_HOME/lib --enable-mca-no-build=btl-uct --with-pmix=internal
sudo make -j install


3. Run osu-micro-benchmarks (OMB)

a. Download OMB
Either from here > MVAPICH :: Benchmarks > (recommended) or clone
git clone > GitHub - forresti/osu-micro-benchmarks: MPI Microbenchmarks


b. Build OMB
MPI_HOME=$OMPI_PREFIX
…/configure --enable-cuda --with-cuda-include=$CUDA_HOME/include --with-cuda-libpath=$CUDA_HOME/lib64 CC=$MPI_HOME/bin/mpicc CXX=$MPI_HOME/bin/mpicxx --prefix=$PWD
make -j install

c. Run OMB
export LD_LIBRARY_PATH=$MPI_HOME/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$UCX_HOME/lib:$LD_LIBRARY_PATH
cat hostfile
hsw210 slots=1 max-slots=1
hsw211 slots=1 max-slots=1
mpirun -np 2 --hostfile $PWD/hostfile --mca pml ucx -x UCX_MEMTYPE_CACHE=n -x UCX_TLS=rc,mm,cuda_copy,gdr_copy,cuda_ipc -x LD_LIBRARY_PATH $PWD/get_local_ompi_rank $PWD/mpi/pt2pt/osu_bw D D

Hope you find this helpful.

+chris

Hi,

Here is an update on my issues on Pleiades running multi-node GPU runs:

  1. Using OpenMPI 4.0.1 or 4.0.2rc1 with PBS and UCX (latest stable or latest) and PGI 18.10 causes my code to crash. On certain types of runs it works, and on my other GPU code it works, but it crashes in a specific routine. This routine works fine in all my other test on multi-node GPU runs on other systems and the tests below so I do not think it is a code bug, but a UCX bug (a similar error message is in their online bug reports).

  2. I was unable to compile OpenMPI 3.x using PGI 18.10 - I got compilation errors.

  3. Using OpienMPI 2.1.2 with PGI 18.10 compiled with PBS and verbs WORKS! on multi-node! Yay!
    The timing result on 2 4xV100 nodes is similar to that on a single 8xV100 node (exactly the same computation time, and a tad slower MPI time as expected).

  4. Using OpenMPI 4.0.2rc1 with PGI 18.10 compiled with PBS and verbs also works (no crashes) but is VERY slow. It also spits out:

[r101i0n2:16086] 15 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[r101i0n2:16086] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[r101i0n2:16086] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init

I assume this has to do with OpenMPI not supporting verbs in versions 4.x.


Soooo

Basically, for my code there seems to be a bug in UCX which means I cannot use OpenMPI 4.x, but I can get my runs done using OpenMPI 2.x with verbs. (I assume PGI 19.x can compile OpenMPI 3.x since it comes with that, so I also will assume 19.x will work with verbs on PBS).

You said that PGI will start packing OpenMPI 4.x with UCX in the next release, but please take this test into account as currently, that will crash my runs.

I can provide a reproducer that you can test with if you would like.

  • Ron

Hi Ron,

Thanks for the feedback.

I have seen some of the same issues with Open MPI + UCX, too, and have raised them with our internal contacts for Open MPI and Mellanox. I will check and see if they are aware of this particular issue. It seems like things are in a bit of flux right now, as development with Open MPI and UCX is rapidly evolving, and things do not always seem to be in sync between the two. Much of this is beyond our direct control within the PGI group, so all we can do is make sure the appropriate people are aware of the issues, and hope they will be addressed soon.

Regards,

+chris

FYI - I did a little more investigation here:

It seems Open MPI 4.0.1 has some bugs with regard to UCX support. I have not yet checked whether the recent Open MPI 4.0.2 release candidate has been fixed to work with UCX 1.6.0, or if you have to update to an Open MPI development snapshot from their git repo (which appears to be slated for release as Open MPI 4.1.0 eventually) to achieve compatibility with UCX 1.6.0.

The unfortunate thing is that documentation about which versions of UCX are compatible with which versions of Open MPI appears to be lacking, so a lot of this information only seems to be attainable through trial and error.

There do exist a few PGI compiler-related issues with Open MPI + UCX that have been opened in our bug tracker, and are currently under investigation. All this is to say that this is a very active area of development on multiple fronts right now - both on the Open MPI + UCX side and the PGI side - so stay tuned!

I am glad you found something to get you by in the meantime, and very sorry about the difficulties you have encountered. Will update once we have more progress to report.

Best regards,

+chris

Thanks for the update!

If you come across a working combo of versions (OpenMPI+UCX+PGI) that works for CUDA-aware MPI in OpenACC codes (especially on PBS) please let me know!

  • Ron

Hi there,

I have been happily using OpenMPI 3 for some time now, but I am once again trying out OpenMPI 4 because the systems I have access to now have GPU direct RDMA CUDA-aware MPI enabled and I really want to use that.

I am currently trying this on the Bridges2 system at PSC.

Their module is “openmpi/4.0.5-pgi20.11” which is using the OpenMPI 4 library located here:
…/20.11-mpi4/Linux_x86_64/20.11/comm_libs/openmpi4/openmpi-4.0.5/

I am getting the same issues I did before with the code seg faulting, even when running on 1 GPU.

Have the OpenMPI+UCX issues been fixed?
If so, is there a specific version combination that works?

I have sent a reproducer in the past but can send another one if needed.
Also, the POT3D code would be a good code to test as well.

What direction would you suggest I go?

  • Ron

Ron:

We have some changes in the works with regard to Open MPI in the NVIDIA HPC SDK that we hope to announce soon. I cannot say much more than that at the moment, though.

I am a bit familiar with POT3D, but if you have a particular configuration and/or test case involving POT3D that you would like for us to look at, please let me know. Particularly if you have something that exemplifies the issues you are seeing. I think this would be a good thing to get into our regression test suites.

Thanks!

+chris

1 Like

Hi,

Well that sounds happily mysterious :)

As for POT3D, we just today released the code on a github repo at: https://github.com/predsci/POT3D
In that repo, there are the same tests as we submitted for the SPEC benchmark, as well as other example runs (see the testsuite and examples folders).

You can test any of those on a single GPU, multiple GPUs in a node, or multi-node,multi-GPU (be aware of the problem sizes as some are too big to fit on 1 GPU)

I have found that OpenMPI v3 works for all cases, (sometimes needing a hostfile for some systems) but that OpenMPI v4 + UCX seg faults even on a single GPU.
The error is similar to this thread that Mat has commented on:
https://stackoverflow.com/questions/64281545/how-to-enable-cuda-aware-openmpi

If the github version of POT3D ends up working across multi-GPU nodes, then I should be set.
The possible exception is that our main code (MAS) uses allocatable arrays within a derived type in the hostdata region for the MPI calls, while POT3D only has straight-up allocatable arrays. So maybe there would be a difference there.
I plan to create a small reproducible test/miniapp for that part of the MAS code, but I do not have that in a good enough form yet.

  • Ron

Hi,

It has been a while since this post and I wanted to give an update / question.

I am now trying to use the same MPI+OpenACC codes on the Delta-GPU system at NCSA (NVIDIA A100s) as part of their early access period.

Their system does NOT have infiniband (so no GDR RDMA correct?)
Instead it has HPE’s SlingShot network.

The NVIDIA HPC SDK default is version 22.2 with OpenMPI 4.1.2 with UCX 1.11.2.

Happily, our codes actually work on the system - even with multi-node runs!
So it looks like the seg faulting issues in this thread have been worked out!

Unhappily, the MPI communication time for MPI iSend and IRecv are very very long and kill our scaling performance.

For example, if I run the code on one of Delta’s 8-GPU nodes I get a comm time of 480 seconds (12% of the runtime), but when I run on 2 of the 4-GPU nodes, I get a comm time of 2660 seconds (43% of the runtime). The computation time for each run is the same.

I have been working with the NCSA staff to try to address this but to no avail.

I have tried multiple ways of launching the job (e.g. mpirun --bind-to core --map-by ppr:1:numa --report-bindings --mca btl self,vader,smcuda) but to no avail.

I also ran with OpenMPI 3 (that comes with the compiler) and that reduced the comm time down to 1742 seconds (33%) but that is still too high compared to some old runs I did with OpenMPI 3 on NASA’s Pleiades V100 nodes in the past.

So I was wondering if you had any ideas about what to try?

This issue exists in the POT3D code within the SPEC HPC benchmark.
Has that benchmark been made to run efficiently on multi-node systems using HPE’s slingshot and OpenACC? [The CUDA-AWARE MPI is used by utilizing OpenACC’s host_data use_device() around the MPI calls].

Thanks for any tips!

– Ron

Can you run the OSU Benchmarks on this cluster, using the latest OpenMPI/UCX/GDRcopy/CUDA?
This would highlight whether the GDRcopy is working or not.

On this page
https://portal.xsede.org/ncsa-delta
it says that it has GDRcopy. The OSU Benchmarks should show whether it’s working or not.

Hi,

NCSA messaged me that they were unsure if it works or not:

Also, Delta's not an Infiniband based cluster. "
* A low latency and high bandwidth HPE/Cray Slingshot interconnect between compute nodes"

...no ibverbs. I recall that early versions of gpu-direct required infiniband, so that may be worth noting here.

As for the OSU benchmarks - it looks like NCSA has already provided a module for it!

-------------------------- /sw/spack/delta-2022-03/modules/lmod/openmpi/4.1.2-a76heua/nvhpc/22.2 ---------------------------
osu-micro-benchmarks/5.7.1

Could you please post the command that I should use to test the multi-node GPU communication?

– Ron

Put me in touch with the NCSA admins and I’ll take a look.
My NCSA/BlueWater keyfob has run out, I’d like to ask them to re-activate my account.