Problems with MPI and OpenACC using 21.3: try-catch block prevents parallelization and annoying output is created

d.willsch · July 1, 2021, 4:05pm

I’m encountering two problems using OpenMPI and OpenACC shipped with the NVHPC SDK 21.3

Here is a minimal example:

#include <iostream>
#include <openacc.h>
#include <cuda_runtime.h>
#include <cuComplex.h>
#include <mpi.h>

int main(int argc, char* argv[]) {
    try {
        int mpi_rank = -1;
        int mpi_size = 0;

        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
        MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

        uint64_t size = 8;

        cuDoubleComplex* psi1 = new cuDoubleComplex[size]();
        cuDoubleComplex* psi2 = new cuDoubleComplex[size]();

        for (uint64_t i = 0; i < size; ++i)
            psi1[i].x = mpi_rank;

        #pragma acc parallel loop copyin(psi1[size]) copyout(psi2[size])
        for (uint64_t i = 0; i < size; ++i)
            psi2[i] = psi1[i];

        for (uint64_t i = 0; i < size; ++i)
            std::cout << "[" << mpi_rank << "]: " << psi2[i].x << ", " << psi2[i].y << std::endl;

        delete[] psi2;
        delete[] psi1;
        MPI_Finalize();
    } catch (...) {
        return -1;
    }
    return 0;
}

I compiled and ran it using

mpic++ --std=c++17 -acc -fast -mp -gopt -gpu=lineinfo -Minfo=accel -Mcuda -Mcudalib=cublas -Wall -Wextra -pedantic minex.cpp -o minex -llapack -lblas -fortranlibs
mpirun -n 2 -q minex

The problems are:

If the try and catch statements are given (not commented out), nvc++ does not parallelize the trivial acc loop:

main:
     22, Generating copyout(psi2[:size]) [if not already present]
         Generating copyin(psi1[:size]) [if not already present]
         Generating Tesla code
         25, #pragma acc loop seq
     25, Complex loop carried dependence of psi1->x,psi2->x,psi1->y prevents parallelization
         Loop carried dependence of psi2->x,psi1->y prevents parallelization
         Loop carried backward dependence of psi2->x,psi1->y prevents vectorization
     26, Accelerator restriction: induction variable live-out from loop: i

Each MPI process and each thread create annoying empty files in my directory of the form 0_r0_t1, 1_r1_t1, … Why is that and how can I turn it off? Additionally, if I don’t run with -q, I see the following output from OpenMPI which I don’t know what to do with:

--------------------------------------------------------------------------
[[23575,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: dxer

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
...
[dxer:916980] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[dxer:916980] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Version information from mpic++ --version:

nvc++ 21.3-0 LLVM 64-bit target on x86-64 Linux -tp skylake
NVIDIA Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Thanks and best wishes,
Dennis

MatColgrove · July 1, 2021, 5:47pm

Hi Dennis,

Accelerator restriction: induction variable live-out from loop: i

My best guess is that this is a scoping issue where “i” is not getting defined as being scoped within the try block. Hence it can be accessed in the catch and modified, thus preventing parallelization. I’ve added a problem report (TPR #30296) and sent it to engineering for review.

The work around is to declare “i” in the try block rather than the for loop. Something like:

% cat minex2.cpp
#include <iostream>
#include <openacc.h>
#include <cuda_runtime.h>
#include <cuComplex.h>
#include <mpi.h>

int main(int argc, char* argv[]) {
    try {
        int mpi_rank = -1;
        int mpi_size = 0;
#ifdef WORKS
        uint64_t ii;
#endif
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
        MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

        uint64_t size = 8;

        cuDoubleComplex* psi1 = new cuDoubleComplex[size]();
        cuDoubleComplex* psi2 = new cuDoubleComplex[size]();

        for (uint64_t i = 0; i < size; ++i)
            psi1[i].x = mpi_rank;

        #pragma acc parallel loop copyin(psi1[size]) copyout(psi2[size])
#ifdef WORKS
        for (ii = 0; ii < size; ++ii)
#else
        for (uint64_t ii = 0; ii < size; ++ii)
#endif
            psi2[ii] = psi1[ii];

        for (uint64_t i = 0; i < size; ++i)
            std::cout << "[" << mpi_rank << "]: " << psi2[i].x << ", " << psi2[i].y << std::endl;

        delete[] psi2;
        delete[] psi1;
        MPI_Finalize();
    } catch (...) {
        return -1;
    }
    return 0;
}
% mpicxx -acc -cuda minex2.cpp -Minfo=accel --c++17
main:
     24, Generating copyout(psi2[:size]) [if not already present]
         Generating copyin(psi1[:size]) [if not already present]
         Generating Tesla code
         30, #pragma acc loop seq
     30, Complex loop carried dependence of psi1->x,psi1->y,psi2->x prevents parallelization
         Loop carried dependence of psi2->x,psi1->y prevents parallelization
         Loop carried backward dependence of psi2->x,psi1->y prevents vectorization
     32, Accelerator restriction: induction variable live-out from loop: ii
% mpicxx -acc -cuda minex2.cpp -Minfo=accel --c++17 -DWORKS
main:
     28, Generating copyout(psi2[:size]) [if not already present]
         Generating copyin(psi1[:size]) [if not already present]
         Generating Tesla code
         28, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

For the MPI issue, I’m not sure but will ask the folks the manage our OpenMPI builds if they have ideas.

Thanks,
Mat

d.willsch · July 1, 2021, 6:39pm

Thanks a lot for your quick reply Mat. Your explanation sounds reasonable.

Looking forward to hearing from your MPI guys about the other issue.

Best,
Dennis

d.willsch · July 19, 2021, 4:20pm

Any news from the MPI guys on how to suppress the 0_r0_t1, 1_r1_t1, … files?

Thanks,
Dennis

mfatica · July 19, 2021, 4:46pm

You will need to wait for 21.7. The bug is in the OpenACC runtime and there is no workaround.

system · September 17, 2021, 4:47pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10416	May 2, 2014
An error occurred when using MPI and OpenACC together nvc, nvc++ and nvfortran	11	1021	April 26, 2023
Openacc with cuda nvc, nvc++ and nvfortran cuda	4	397	April 22, 2023
Failure when using OpenACC after MPI_Init nvc, nvc++ and nvfortran	7	1587	April 23, 2021
NV 24.1 Default MPI seg faulting on derived type host_data MPI calls - sometimes nvc, nvc++ and nvfortran	15	771	June 6, 2024
Issues with migrating OpenACC codes to a newer card and HPC SDK nvc, nvc++ and nvfortran	6	61	November 19, 2024
Using multiple GPUs Legacy PGI Compilers	7	22091	August 11, 2009
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	23705	July 27, 2010
Is there a general template to write hybrid MPI and openACC? Legacy PGI Compilers	9	938	November 18, 2020
Can't compile with OpenMPI 4.1.4, "broken function" nvc, nvc++ and nvfortran	5	903	August 17, 2022

Problems with MPI and OpenACC using 21.3: try-catch block prevents parallelization and annoying output is created

Related topics