Correct environment variables for MPI + OpenMP + OpenACC hybrid code

Hi,

I have a hybrid code with MPI + OpenMP + OpenACC using ta=multicore.
When I run the code I encounter extreme low performance. The code does not contain nested regions of OpenMP and OpenACC.
When I replace the OpenACC sections by OpenMP directives and remove ta=multicore I get the desired performance.
I think it is related to thread affinity but I’m unable figure it out. Do I need to set additional flags/environment variables?

Thank you for your help

I compile the code with the following flags:

FFLAGS  = -fast -mcmodel=medium -mp -tp=skylake -m64 -cpp -Mmkl -acc -Minfo=acc  -ta=multicore

I run this on an Intel Skylake CPU, I set the following environment variables

export ACC_NUM_CORES=20
export OMP_NUM_THREADS=40
mpirun  -n 1 -x UCX_MEMTYPE_CACHE=n ./prog

Hi Peter,

I’ll need to do a bit of research on this one. I’ve not tried mixing OpenMP and OpenACC targeting multicore together myself and not entirely sure of the interaction of the two runtimes.

Also I see you are using MKL which could also be using OpenMP. Are you calling MKL from any OpenACC regions?

If at all possible, a reproducing example would be very welcome to ensure that I’m am to replicate and then investigate the issue here.

Thanks,
Mat

Hi Mat,

thank you for your answer. I experimented a little bit and found that if I pass –bind-to socket to mpirun it works correctly.

No the MKL is not called from OpenACC.

I have another question regarding ACC_NUM_CORES and OMP_NUM_THREADS. What is the relation between these two variables?
When users run a hybrid code they have to set ACC_NUM_CORES to the number of cores and OMP_NUM_THREADS to the number of threads? When ACC_NUM_CORES is not set, what kind of default value is used?

Thank you for your help.

When users run a hybrid code they have to set ACC_NUM_CORES to the number of cores and OMP_NUM_THREADS to the number of threads?

Correct. They are independent.

When ACC_NUM_CORES is not set, what kind of default value is used?

The runtime will default to using all the physical cores on the system. While OpenMP defaults to the total number of cores (including hyper threads).

Here’s a simple example to illustrate. I’m running on a 2 socket Skylake, 20 physical cores per socket with 2 hyper threads per core.

% cat acc_mp.c
#include <stdio.h>
#include <omp.h>
#define N 1000
int main() {
  int v[N];
#pragma acc parallel loop
  for(int i = 0; i < N; ++i) {
    v[i] = i;
    if(i == 0) {
      printf("ACC: #threads: %d\n", omp_get_num_threads());
    }
  }
#pragma omp parallel
  {
#pragma omp single
    {
      printf("OMP #threads: %d\n", omp_get_num_threads());
    }
  }
}
% pgcc -mp -acc -ta=multicore acc_mp.c
% echo $OMP_NUM_THREADS
OMP_NUM_THREADS: Undefined variable.
% echo $ACC_NUM_CORES
ACC_NUM_CORES: Undefined variable.
% a.out
ACC: #threads: 40
OMP #threads: 80
% setenv ACC_NUM_CORES 10
% setenv OMP_NUM_THREADS 20
% a.out
ACC: #threads: 10
OMP #threads: 20

Hope this helps,
Mat

Thanks for your answer. I will check it.