Correct environment variables for MPI + OpenMP + OpenACC hybrid code

Joseph_A · May 20, 2020, 7:28am

Hi,

I have a hybrid code with MPI + OpenMP + OpenACC using ta=multicore.
When I run the code I encounter extreme low performance. The code does not contain nested regions of OpenMP and OpenACC.
When I replace the OpenACC sections by OpenMP directives and remove ta=multicore I get the desired performance.
I think it is related to thread affinity but I’m unable figure it out. Do I need to set additional flags/environment variables?

Thank you for your help

I compile the code with the following flags:

FFLAGS  = -fast -mcmodel=medium -mp -tp=skylake -m64 -cpp -Mmkl -acc -Minfo=acc  -ta=multicore

I run this on an Intel Skylake CPU, I set the following environment variables

export ACC_NUM_CORES=20
export OMP_NUM_THREADS=40
mpirun  -n 1 -x UCX_MEMTYPE_CACHE=n ./prog

MatColgrove · May 20, 2020, 11:21pm

Hi Peter,

I’ll need to do a bit of research on this one. I’ve not tried mixing OpenMP and OpenACC targeting multicore together myself and not entirely sure of the interaction of the two runtimes.

Also I see you are using MKL which could also be using OpenMP. Are you calling MKL from any OpenACC regions?

If at all possible, a reproducing example would be very welcome to ensure that I’m am to replicate and then investigate the issue here.

Thanks,
Mat

Joseph_A · May 21, 2020, 1:53am

Hi Mat,

thank you for your answer. I experimented a little bit and found that if I pass –bind-to socket to mpirun it works correctly.

No the MKL is not called from OpenACC.

I have another question regarding ACC_NUM_CORES and OMP_NUM_THREADS. What is the relation between these two variables?
When users run a hybrid code they have to set ACC_NUM_CORES to the number of cores and OMP_NUM_THREADS to the number of threads? When ACC_NUM_CORES is not set, what kind of default value is used?

Thank you for your help.

MatColgrove · May 21, 2020, 4:23pm

When users run a hybrid code they have to set ACC_NUM_CORES to the number of cores and OMP_NUM_THREADS to the number of threads?

Correct. They are independent.

When ACC_NUM_CORES is not set, what kind of default value is used?

The runtime will default to using all the physical cores on the system. While OpenMP defaults to the total number of cores (including hyper threads).

Here’s a simple example to illustrate. I’m running on a 2 socket Skylake, 20 physical cores per socket with 2 hyper threads per core.

% cat acc_mp.c
#include <stdio.h>
#include <omp.h>
#define N 1000
int main() {
  int v[N];
#pragma acc parallel loop
  for(int i = 0; i < N; ++i) {
    v[i] = i;
    if(i == 0) {
      printf("ACC: #threads: %d\n", omp_get_num_threads());
    }
  }
#pragma omp parallel
  {
#pragma omp single
    {
      printf("OMP #threads: %d\n", omp_get_num_threads());
    }
  }
}
% pgcc -mp -acc -ta=multicore acc_mp.c
% echo $OMP_NUM_THREADS
OMP_NUM_THREADS: Undefined variable.
% echo $ACC_NUM_CORES
ACC_NUM_CORES: Undefined variable.
% a.out
ACC: #threads: 40
OMP #threads: 80
% setenv ACC_NUM_CORES 10
% setenv OMP_NUM_THREADS 20
% a.out
ACC: #threads: 10
OMP #threads: 20

Hope this helps,
Mat

Joseph_A · May 25, 2020, 8:15am

mkcolg:

When users run a hybrid code they have to set ACC_NUM_CORES to the number of cores and OMP_NUM_THREADS to the number of threads?

Correct. They are independent.

When ACC_NUM_CORES is not set, what kind of default value is used?

The runtime will default to using all the physical cores on the system. While OpenMP defaults to the total number of cores (including hyper threads).

Here’s a simple example to illustrate. I’m running on a 2 socket Skylake, 20 physical cores per socket with 2 hyper threads per core.
% cat acc_mp.c
#include <stdio.h>
#include <omp.h>
#define N 1000
int main() {
  int v[N];
#pragma acc parallel loop
  for(int i = 0; i < N; ++i) {
    v[i] = i;
    if(i == 0) {
      printf("ACC: #threads: %d\n", omp_get_num_threads());
    }
  }
#pragma omp parallel
  {
#pragma omp single
    {
      printf("OMP #threads: %d\n", omp_get_num_threads());
    }
  }
}
% pgcc -mp -acc -ta=multicore acc_mp.c
% echo $OMP_NUM_THREADS
OMP_NUM_THREADS: Undefined variable.
% echo $ACC_NUM_CORES
ACC_NUM_CORES: Undefined variable.
% a.out
ACC: #threads: 40
OMP #threads: 80
% setenv ACC_NUM_CORES 10
% setenv OMP_NUM_THREADS 20
% a.out
ACC: #threads: 10
OMP #threads: 20
Hope this helps,
Mat

Thanks for your answer. I will check it.

Topic		Replies	Views
Can still use OMP_NUM_THREADS without OpenMP compilation Legacy PGI Compilers	4	2958	November 12, 2019
Performance with hybrid setup Legacy PGI Compilers	6	819	March 18, 2022
Naive OpenMP configuration question Legacy PGI Compilers	14	33894	February 17, 2011
OpenMP + OpenACC problem Legacy PGI Compilers	9	5264	April 17, 2019
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10802	March 15, 2013
Using multiple GPUs Legacy PGI Compilers	7	22091	August 11, 2009
OpenACC no parallelisation with ta=multicore Legacy PGI Compilers	7	1000	December 1, 2023
Combining OpenMP and OpenACC Legacy PGI Compilers	4	6211	November 14, 2017
Environment Variables Legacy PGI Compilers	1	11284	February 8, 2006
OpenMP and magically evil performance results Legacy PGI Compilers	4	8027	July 1, 2014

Correct environment variables for MPI + OpenMP + OpenACC hybrid code

Related topics