Call to collective mpi subroutine with openacc host_data directive

shatrov.oleg.a · March 23, 2021, 8:03pm

Dear all,

I am developing library that performs FFT in 3d and 2d using MPI. Recently I started testing this library with cuFFT as a device 1d FFT executor. My library uses MPI derived datatypes to send and receive aligned data.
I noticed that 99.99% of time I run code on gpu is spent on single MPI_Alltoall call. When I did a profile on the execution it showed that there were more than 2 million calls to MemCpy (HtoD).

Profile of a single rank can be found here

Can somebody explain how this works and why is this happening?

I am using PGI 20.4.

Best regard,
Oleg

MatColgrove · March 24, 2021, 5:48pm

Hi Oleg,

If I understand correctly, you have a MPI program using OpenACC to manage your data where you include an MPI_Alltoall call enclosed in a “host_data” region in order to to use CUDA Aware MPI, but are seeing these large number of memcpy’s?

Are you using CUDA Unified Memory (i.e. -gpu=managed) by chance? If so, OpenMPI can’t tell that managed variables are on the device so this could be triggering the issue. Otherwise, I’m not sure. I’ve used MPI_Alltoall with CUDA Aware MPI and did not see a similar behavior. I’d need a reproducing example which shows the issue in order to investigate.

-Mat

shatrov.oleg.a · March 24, 2021, 5:59pm

Hi Mat,

You did understand me correctly. My call to MPI_Alltoall looks like this:

  subroutine transpose(self, send, recv)
    class(transpose_t), intent(in)    :: self       !< Transposition class
    type(*),            intent(in)    :: send(..)   !< Send buffer
    type(*),            intent(inout) :: recv(..)   !< Recv buffer

!$acc data present(send, recv)
!$acc host_data use_device(send, recv)
    if(self%is_even) then 
      call MPI_Alltoall(send, 1, self%send%dtypes(1), recv, 1, self%recv%dtypes(1), self%comm)
    else
      call MPI_Alltoallw(send, self%send%counts, self%send%displs, self%send%dtypes,          &
                         recv, self%recv%counts, self%recv%displs, self%recv%dtypes, self%comm)
    endif
!$acc end host_data
!$acc end data
  end subroutine transpose

I am not using managed memory.
But I am using MPI derived datatypes: combination of vector, hvector, contiguous and resized. Can this be an issue?

Best regards,
Oleg.

MatColgrove · March 24, 2021, 6:44pm

Possibly? Though I’ve never tried this before so don’t know. If this is the case, then it would be an issue with the MPI implementation rather than anything to do with the compiler.

What MPI implementation are you using?

Assuming you’re using the OpenMPI we shipped with 20.4 (v3.1.5), you might try downloading NVHPC 21.2 and then build with OpenMPI 4.0 to see if they made improvements.

shatrov.oleg.a · March 24, 2021, 8:04pm

I already tried. Results are the same.
I ran ompi_info shipped with hpc_sdk 21.2. It looks like it has version 3.1.5, not 4.0

#!/bin/bash

EXE=$(basename $0)

MY_PATH=$(readlink -f $0)
MY_DIR=$(dirname $MY_PATH)
OMPI_ROOT=$(readlink -f $MY_DIR/..)
export OPAL_PREFIX=$OMPI_ROOT

# for -Mscalapack
export MPILIBNAME=openmpi
export MPILIBVER=3.1.5
export MPIDIR=$OMPI_ROOT

$MY_DIR/.bin/$EXE "$@"

Also when i run executable Open MPI sends me lots of warnings. I am not sure if they can be related to this problem…

--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'gpu03', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
 
Please see this FAQ entry for more details:
 
  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
 
NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
 
  Local host:            gpu03
  Device name:           i40iw0
  Device vendor ID:      0x8086
  Device vendor part ID: 14290
 
Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
 
NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
 
  Local host:           gpu03
  Local device:         i40iw0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------

and

[gpu03:131337] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[gpu03:131337] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[gpu03:131337] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

MatColgrove · March 25, 2021, 4:12pm

21.2 ships both with the default being 3.1.5. To use 4.0, set your PATH to point to “/base/path/to/nvhpc/Linux_x86_64/21.2/comm_libs/openmpi4/openmpi-4.0.5/bin/” adjusting the base path as needed.

As for the warning, I’m not sure though it may be configuration issue with the OFED driver or UCX. I did a web search and found this thread at OpenMPI where a user posted a similar issue: Is OpenMPI supporting RDMA? · Issue #5789 · open-mpi/ompi · GitHub

Note that the OpenMP 4.0 that we ship does include UCX support, but 3.1.5 does not.

shatrov.oleg.a · March 26, 2021, 9:02am

Unfortunately, openmpi4 failes at MPI_Init:

#0  0x00007fffe195dd15 in __memcpy_ssse3_back ()
   from /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6
#1  0x00007fffe0f24097 in pmix3x_value_load ()
    at ../../../../../opal/mca/pmix/pmix3x/pmix3x.c:861
#2  0x00007fffe0f2a188 in pmix3x_put ()
    at ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c:581
#3  0x00007ffff742880e in mca_pml_ucx_send_worker_address_type ()
    at ../../../../../ompi/mca/pml/ucx/pml_ucx.c:105
#4  0x00007ffff7427b50 in mca_pml_ucx_send_worker_address ()
    at ../../../../../ompi/mca/pml/ucx/pml_ucx.c:149
#5  0x00007ffff7426109 in mca_pml_ucx_init ()
    at ../../../../../ompi/mca/pml/ucx/pml_ucx.c:313
#6  0x00007ffff742b1f9 in mca_pml_ucx_component_init ()
    at ../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:97
#7  0x00007ffff7424968 in mca_pml_base_select ()
    at ../../../../ompi/mca/pml/base/pml_base_select.c:126
#8  0x00007ffff7474fbe in ompi_mpi_init ()
    at ../../ompi/runtime/ompi_mpi_init.c:646
#9  0x00007ffff730847c in PMPI_Init () at pinit.c:69
#10 0x00007ffff778ed39 in ompi_init_f () at pinit_f.c:84
#11 0x000000000040fb44 in test_c2c_2d () at oacc_test1.F90:33
#12 0x0000000000403543 in main ()
#13 0x0000000000000000 in ?? ()

Did not quite figure it out…

shatrov.oleg.a · March 26, 2021, 7:16pm

Dear Mat,

I managed to execute code with Open-MPI 4.0.5. Executable was linked to the wrong ucx library, not the one shipped with hpc_sdk. Execution time was even slower…

I created an issue at open-mpi repository. Maybe I will find some help there.

Thanks

Best regards,
Oleg

MatColgrove · March 26, 2021, 7:24pm

Sounds good Oleg. I’ll be interested in what they say. If they say it’s a problem with our build of OpenMPI, let me know and I’ll ask the person that does our builds to investigate.

Topic		Replies	Views
Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand nvc, nvc++ and nvfortran openmpi	12	1620	March 11, 2024
Building openMPI with UCX - General Advice Software And Drivers	4	3915	January 25, 2022
Request support/help for PBS with OpenMPI Legacy PGI Compilers	21	14820	August 9, 2022
pgf90 + openacc & managed memory / um-evaluation package Legacy PGI Compilers	8	8903	June 16, 2015
Direct GPU-to-GPU data transfer with OpenACC+managed+MPI nvc, nvc++ and nvfortran	4	1040	April 12, 2022
PGI HPF issues Legacy PGI Compilers	8	6950	February 14, 2012
CUDA-aware MPI on 1 GPU transferring data to host? Legacy PGI Compilers	7	5170	October 2, 2017
Nested OpenMP not supported in community edition? Legacy PGI Compilers	16	8554	January 18, 2019
Can't compile with OpenMPI 4.1.4, "broken function" nvc, nvc++ and nvfortran	5	850	August 17, 2022
Open MPI + PGI 8.04 compilation failure Legacy PGI Compilers	5	10085	February 17, 2009

Call to collective mpi subroutine with openacc host_data directive

Related topics