Is there a general template to write hybrid MPI and openACC?

I recently want to make a hybrid code using MPI and openACC, I want to know is there a general template that I can follow to convert serial code to mpi + openACC
In this link:

Hybrid mpi and openmp

You can see that we can use MPI and OpenMP as hybrid and it has general template.
And can you please provide to me simple “c or c++” code that written in hybrid (MPI+Open ACC).

Thank you in advance
-Hisham

Hi Hisham,

While a little old, Jiri’s training from GTC2015 might be a good place to start:

The only thing I’d change to Jiri’s code is the device assignment. He’s doing a direct MPI rank number to device number which is fine for a single node running the same number of ranks as available devices, but doesn’t work cross node or if you have more ranks. I typically use the following boiler plate code for device assignment. This uses the local rank id and then round-robins the device assignment.

#ifdef _OPENACC
#include <openacc.h>
#endif
.... later in the code after MPI_init is called ..
#ifdef _OPENACC
  acc_device_t my_device_type;
  int num_devices;
  int gpuId;
  MPI_Comm shmcomm;
  int local_rank;

  MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
                    MPI_INFO_NULL, &shmcomm);
  MPI_Comm_rank(shmcomm, &local_rank);
  my_device_type = acc_get_device_type();
  num_devices = acc_get_num_devices(my_device_type);
  gpuId = local_rank % num_devices;
  acc_set_device_num(gpuId, my_device_type);
#endif

Thank you Mat for your replay,
I read the links that you gave me and I still confused about how to make my code hybrid
if I have this code:

include
using namespace std;
int main()
{
int n, sum = 0;
scanf(“%d”, &n);

#pragma acc parallel loop
  for (int i = 1; i <= n; ++i)
{
    #pragma acc atomic update
  sum  = sum + i;
}
printf("Sum = %d ", sum);
return 0;

}

how I can make it hybrid (with MPI) here?

Are you asking for a tutorial on how to use MPI? That’s too big of a topic here but there’s plenty of tutorials out there (like https://mpitutorial.com/)

For this code are you thinking you want to partition the loop iterations across multiple MPI ranks and then do an MPI_AllReduce to obtain the final sum? If so, take a look at: mpitutorial/reduce_avg.c at gh-pages · mpitutorial/mpitutorial · GitHub

You’d then add the OpenACC device assignment code and put an OpenACC parallel loop around the “for” loop that’s doing the local sum.

While OpenACC can certainly be used within MPI enabled programs, they are two separate and distinct things. The only MPI specific things with regards to OpenACC are how to set the device each rank uses and using the OpenACC “host_data” region to pass in CUDA device pointers to the MPI call if you’re using CUDA Aware MPI.

I agree with you on this point

That’s too big of a topic here but there’s plenty of tutorials out there

and I am working to understand MPI very well.
my goal is to target cluster system with my code but I want first to understand the hybridization between MPI and OpenACC, I saw the links that you provided to me but I have an issue regarding the last point that you mention:

The only MPI specific things with regards to OpenACC are how to set the device each rank uses and using the OpenACC “host_data” region to pass in CUDA device pointers to the MPI call if you’re using CUDA Aware MPI.

I don’t know how to use set device and host_data with MPI.
I have this simple code to send the value from one to process and print it in another one, but I don’t know how to put OpenACC and make it run as hybrid

#include "mpi.h"
#include <iostream>
using namespace std;

int main(int argc, char *argv[])
{
	MPI_Init(NULL, NULL);
	int rank;
	MPI_Status status;
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	int x,y;
	if (rank == 0)
	{
		for (int i = 1; i < 10; i++)
		{
			x = i;
			MPI_Send(&x, i, MPI_INT, 1, 1, MPI_COMM_WORLD);
		}
	}
	else
	{
		for (int i = 1; i < 10; i++)
		{
			MPI_Recv(&y, i, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
			printf(" the received date is: %d\n", y);
		}
	}
	MPI_Finalize();
	return 0;
}

thank you in advance
-Hisham

There’s no opportunity to use the GPU for this code so I modified it to populate an array rather than use scalars.

I’ve also included conditional compilation depending on if your MPI supports CUDA Aware MPI (not all do). With CUDA Aware MPI, the data is transfer directly from device to device. Without, you need to copy the data to/from the host and device.

% cat testmpi.cpp
#include "mpi.h"
#include <iostream>
#ifdef _OPENACC
#include <openacc.h>
#endif

#ifndef NUM
#define NUM 1024
#endif

using namespace std;

int main(int argc, char *argv[])
{
        MPI_Init(NULL, NULL);
        int rank;
        int *x;
        MPI_Status status;
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#ifdef _OPENACC
        acc_device_t my_device_type;
        int num_devices;
        int gpuId;
        MPI_Comm shmcomm;
        int local_rank;

        MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
                            MPI_INFO_NULL, &shmcomm);
        MPI_Comm_rank(shmcomm, &local_rank);
        my_device_type = acc_get_device_type();
        num_devices = acc_get_num_devices(my_device_type);
        gpuId = local_rank % num_devices;
        acc_set_device_num(gpuId, my_device_type);
        printf("Rank %d is using Device %d of %d\n",rank,gpuId,num_devices);
#endif

        x = new int[NUM];
#pragma acc enter data create(x[:NUM])

        if (rank == 0)
        {
             #pragma acc parallel loop present(x)
             for (int i = 0; i < NUM; ++i) {
                x[i] = i;
             }
#ifdef USE_CUDA_AWARE_MPI
#pragma acc host_data use_device(x)
{
#else
#pragma acc update self(x[:NUM])
#endif
             MPI_Send(x, NUM, MPI_INT, 1, 1, MPI_COMM_WORLD);
#ifdef USE_CUDA_AWARE_MPI
}
#endif
        }
        else
        {
#ifdef USE_CUDA_AWARE_MPI
#pragma acc host_data use_device(x)
{
#endif
             MPI_Recv(x, NUM, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
#ifdef USE_CUDA_AWARE_MPI
}
#else
#pragma acc update device(x[:NUM])
#endif
             #pragma acc parallel loop present(x)
             for (int i = 0; i < 10; ++i) {
                printf(" the received date is: %d\n", x[i]);
             }
        }
#pragma acc exit data delete(x)
        delete [] x;
        MPI_Finalize();
        return 0;
}
% mpicxx testmpi.cpp -acc -Minfo=accel
main:
     40, Generating enter data create(x[:1024])
     41, Generating present(x[:1])
         Generating Tesla code
         43, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     52, Generating update self(x[:1024])
     63, Generating update device(x[:1024])
         Generating present(x[:1])
         Generating Tesla code
         70, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
     75, Generating exit data delete(x[:1])
% mpirun -np 2 a.out
Rank 0 is using Device 0 of 4
Rank 1 is using Device 1 of 4
 the received date is: 0
 the received date is: 1
 the received date is: 2
 the received date is: 3
 the received date is: 4
 the received date is: 5
 the received date is: 6
 the received date is: 7
 the received date is: 8
 the received date is: 9
% mpicxx testmpi.cpp -acc -Minfo=accel -DUSE_CUDA_AWARE_MPI
main:
     40, Generating enter data create(x[:1024])
     41, Generating present(x[:1])
         Generating Tesla code
         43, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     65, Generating present(x[:1])
         Generating Tesla code
         70, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
     75, Generating exit data delete(x[:1])
% mpirun -np 2 a.out
Rank 1 is using Device 1 of 4
Rank 0 is using Device 0 of 4
 the received date is: 0
 the received date is: 1
 the received date is: 2
 the received date is: 3
 the received date is: 4
 the received date is: 5
 the received date is: 6
 the received date is: 7
 the received date is: 8
 the received date is: 9

Hi Mat & Hisham
I want to ask does this code work as hybrid (MPI+OpenAcc) ?
because I have windows 10 with MPI but as you know the last pgi compiler dose not support windows OS. And in this mechanism will work with any code the same way ,I mean only put openAcc code inside mpi rank?

#include<mpi.h>
#include <iostream>
#include <iomanip>
#include <cstdlib>
using namespace std;
int main(int argc, char *argv[])
{
int numprocs,proc_id;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &proc_id);
    long  nsteps; 
    double pi, step, sum = 0.0, x;
    nsteps = 100;
    step = (1.0) / nsteps;
    if(proc_id==0)
	{
	#pragma acc parallel loop
		for (long i = 0; i < nsteps; ++i)
    {
        x = (i + 0.5) * step;
        #pragma acc atomic update
		sum = sum + 1.0 / (1.0 + x * x);
    }
    pi = 4.0 * step * sum;
    cout << std::fixed;
    cout << "pi is " << std::setprecision(30) << pi << "\n";
	}
	MPI_Finalize();
}

Well, the code will “work” in that it will probably run, but given only one rank is computing the sum, the other ranks would print out wrong answers. So the code really isn’t correct.

,I mean only put openAcc code inside mpi rank?

In general, yes. MPI and OpenACC are separate and distinct parallel models where MPI handles distributed parallelism and OpenACC handling either node level or accelerator offload parallelism.

I personally haven’t used MPI on Windows for probably 10 years so don’t know the current state, but I can’t think of any reason why the code would behave differently. No idea if MSMPI supports CUDA Aware MPI.

Thank Mat for your answer
I already run this code using MPI and work fine and the only rank I used zero but I am asking what happen If I put openAcc directives under Rank 0 ,?

Then only rank zero will use OpenACC to offload the loop to the device. Of course since all MPI processes use the same binary, the OpenACC code will still be there, but just not run on the other ranks due to the if statement.