Dear all,
I am developing library that performs FFT in 3d and 2d using MPI. Recently I started testing this library with cuFFT as a device 1d FFT executor. My library uses MPI derived datatypes to send and receive aligned data.
I noticed that 99.99% of time I run code on gpu is spent on single MPI_Alltoall call. When I did a profile on the execution it showed that there were more than 2 million calls to MemCpy (HtoD).
Profile of a single rank can be found here
Can somebody explain how this works and why is this happening?
I am using PGI 20.4.
Best regard,
Oleg
Hi Oleg,
If I understand correctly, you have a MPI program using OpenACC to manage your data where you include an MPI_Alltoall call enclosed in a “host_data” region in order to to use CUDA Aware MPI, but are seeing these large number of memcpy’s?
Are you using CUDA Unified Memory (i.e. -gpu=managed) by chance? If so, OpenMPI can’t tell that managed variables are on the device so this could be triggering the issue. Otherwise, I’m not sure. I’ve used MPI_Alltoall with CUDA Aware MPI and did not see a similar behavior. I’d need a reproducing example which shows the issue in order to investigate.
-Mat
Hi Mat,
You did understand me correctly. My call to MPI_Alltoall looks like this:
subroutine transpose(self, send, recv)
class(transpose_t), intent(in) :: self !< Transposition class
type(*), intent(in) :: send(..) !< Send buffer
type(*), intent(inout) :: recv(..) !< Recv buffer
!$acc data present(send, recv)
!$acc host_data use_device(send, recv)
if(self%is_even) then
call MPI_Alltoall(send, 1, self%send%dtypes(1), recv, 1, self%recv%dtypes(1), self%comm)
else
call MPI_Alltoallw(send, self%send%counts, self%send%displs, self%send%dtypes, &
recv, self%recv%counts, self%recv%displs, self%recv%dtypes, self%comm)
endif
!$acc end host_data
!$acc end data
end subroutine transpose
I am not using managed memory.
But I am using MPI derived datatypes: combination of vector, hvector, contiguous and resized. Can this be an issue?
Best regards,
Oleg.
Possibly? Though I’ve never tried this before so don’t know. If this is the case, then it would be an issue with the MPI implementation rather than anything to do with the compiler.
What MPI implementation are you using?
Assuming you’re using the OpenMPI we shipped with 20.4 (v3.1.5), you might try downloading NVHPC 21.2 and then build with OpenMPI 4.0 to see if they made improvements.
I already tried. Results are the same.
I ran ompi_info shipped with hpc_sdk 21.2. It looks like it has version 3.1.5, not 4.0
#!/bin/bash
EXE=$(basename $0)
MY_PATH=$(readlink -f $0)
MY_DIR=$(dirname $MY_PATH)
OMPI_ROOT=$(readlink -f $MY_DIR/..)
export OPAL_PREFIX=$OMPI_ROOT
# for -Mscalapack
export MPILIBNAME=openmpi
export MPILIBVER=3.1.5
export MPIDIR=$OMPI_ROOT
$MY_DIR/.bin/$EXE "$@"
Also when i run executable Open MPI sends me lots of warnings. I am not sure if they can be related to this problem…
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'gpu03', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
Please see this FAQ entry for more details:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: gpu03
Device name: i40iw0
Device vendor ID: 0x8086
Device vendor part ID: 14290
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: gpu03
Local device: i40iw0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
and
[gpu03:131337] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[gpu03:131337] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[gpu03:131337] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
21.2 ships both with the default being 3.1.5. To use 4.0, set your PATH to point to “/base/path/to/nvhpc/Linux_x86_64/21.2/comm_libs/openmpi4/openmpi-4.0.5/bin/” adjusting the base path as needed.
As for the warning, I’m not sure though it may be configuration issue with the OFED driver or UCX. I did a web search and found this thread at OpenMPI where a user posted a similar issue: Is OpenMPI supporting RDMA? · Issue #5789 · open-mpi/ompi · GitHub
Note that the OpenMP 4.0 that we ship does include UCX support, but 3.1.5 does not.
Unfortunately, openmpi4 failes at MPI_Init:
#0 0x00007fffe195dd15 in __memcpy_ssse3_back ()
from /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6
#1 0x00007fffe0f24097 in pmix3x_value_load ()
at ../../../../../opal/mca/pmix/pmix3x/pmix3x.c:861
#2 0x00007fffe0f2a188 in pmix3x_put ()
at ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c:581
#3 0x00007ffff742880e in mca_pml_ucx_send_worker_address_type ()
at ../../../../../ompi/mca/pml/ucx/pml_ucx.c:105
#4 0x00007ffff7427b50 in mca_pml_ucx_send_worker_address ()
at ../../../../../ompi/mca/pml/ucx/pml_ucx.c:149
#5 0x00007ffff7426109 in mca_pml_ucx_init ()
at ../../../../../ompi/mca/pml/ucx/pml_ucx.c:313
#6 0x00007ffff742b1f9 in mca_pml_ucx_component_init ()
at ../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:97
#7 0x00007ffff7424968 in mca_pml_base_select ()
at ../../../../ompi/mca/pml/base/pml_base_select.c:126
#8 0x00007ffff7474fbe in ompi_mpi_init ()
at ../../ompi/runtime/ompi_mpi_init.c:646
#9 0x00007ffff730847c in PMPI_Init () at pinit.c:69
#10 0x00007ffff778ed39 in ompi_init_f () at pinit_f.c:84
#11 0x000000000040fb44 in test_c2c_2d () at oacc_test1.F90:33
#12 0x0000000000403543 in main ()
#13 0x0000000000000000 in ?? ()
Did not quite figure it out…
Dear Mat,
I managed to execute code with Open-MPI 4.0.5. Executable was linked to the wrong ucx library, not the one shipped with hpc_sdk. Execution time was even slower…
I created an issue at open-mpi repository. Maybe I will find some help there.
Thanks
Best regards,
Oleg
Sounds good Oleg. I’ll be interested in what they say. If they say it’s a problem with our build of OpenMPI, let me know and I’ll ask the person that does our builds to investigate.