Enabling OpenMP offload breaks OpenACC code

olupton · November 30, 2021, 12:16pm

Hello,

We are working on adding support for OpenMP target offload to a code that currently supports OpenACC offload to GPUs.

We have found that some code, which works as expected when compiled with OpenACC, no longer executes if it is compiled with -mp=gpu — even if we do not replace any OpenACC directives with OpenMP directives.

The example that triggers the problem uses the Eigen linear algebra library. This is working in OpenACC; we are using a fork of Eigen with improved GPU support.

#include "Eigen/Core"
int main() {
#if defined(_OPENMP) && !defined(_OPENACC)
#pragma omp target teams distribute parallel for simd
#else
#pragma acc parallel loop
#endif
  for(int i = 0; i < 1; ++i) {
    Eigen::Matrix<double, 1, 1> F;
    F.norm();
  }
  return 0;
}

If we compile this code in four different ways, as shown in this script:

#!/bin/bash
CFLAGS="-Ieigen -DEIGEN_DONT_VECTORIZE=1"
LDFLAGS="-cuda"
FLAGS_acc="-acc"
RESULT_acc=0
FLAGS_acc_omp_host="-acc -mp"
RESULT_acc_omp_host=0
FLAGS_acc_omp_dev="-acc -mp=gpu"
RESULT_acc_omp_dev=1
FLAGS_omp_dev="-mp=gpu"
RESULT_omp_dev=1
set -x
git clone --branch v3.5-alpha.1 git@github.com:BlueBrain/eigen.git
for config in acc acc_omp_host acc_omp_dev omp_dev
do
  flags_name="FLAGS_${config}"
  result_name="RESULT_${config}"
  expected_result=${!result_name}
  nvc++ ${CFLAGS} ${!flags_name} -c test.cpp -o test.${config}.o
  nvc++ ${LDFLAGS} ${!flags_name} -o ${config} test.${config}.o
  ./${config}
  if [[ $? != ${expected_result} ]]; then
    echo "Unexpected result: $?"
  fi
done

We see that the two configurations including -mp=gpu produce executables that fail at runtime:

Failing in Thread:1
call to cudaGetSymbolAddress returned error 13: Other

We were surprised to see -mp=gpu break working OpenACC code like this. Do you have any idea what could be going on?

The test system has NVHPC/21.9 and CUDA 11.4 installed, and contains V100 GPUs.

MatColgrove · November 30, 2021, 5:24pm

Hi olupton,

Short answer: add “-cuda” to you compilation as well as the link.

Longer answer:

Keep in mind that while from a user perspective OpenACC and OpenMP do similar things, the underlying implementation is very different. OpenMP creates outlined offload regions that are passed to the OpenMP runtime library while OpenACC inlines regions so has more upfront information about the region. Also our OpenACC support is very mature while OpenMP offload is very new.

I don’t have access to the alpha version of “BlueBrain/eigen.git” but was able to recreate the error using “libeigen/eigen.git” so hopefully recreated the correct thing.

As best I can tell, the problem seems to be some global device symbol (looks like a “this” pointer) from the Eigen library not being found. Adding “-cuda” to enable CUDA support in nvc++ and hence, I assume, exposes a “_device_” attribute in the Eigen library so the symbol can be resolve by the OpenMP runtime library.

% nvc++ -Ieigen -DEIGEN_DONT_VECTORIZE=1 test.cpp -mp=gpu -V21.9; a.out
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found

% nvc++ -Ieigen -DEIGEN_DONT_VECTORIZE=1 test.cpp -mp=gpu -V21.9 -cuda ; a.out
%

pramod.s.kumbhar · November 30, 2021, 7:26pm

Olli will confirm details tomorrow but just jumping in for quick feedback:

I don’t have access to the alpha version of “BlueBrain/eigen.git”

As the GitHub repo is public, I think you should be able to clone the repo by changing url from ssh to https i.e.

git clone --branch v3.5-alpha.1 https://github.com/BlueBrain/eigen.git

Short answer: add “-cuda” to you compilation as well as the link.

Ok. Adding -cuda produces following compilation error:

+ nvc++ -Ieigen -DEIGEN_DONT_VECTORIZE=1 -mp=gpu -cuda -c test.cpp -o test.omp_dev.o
"eigen/Eigen/src/Core/products/GeneralBlockPanelKernel.h", line 121: error: static variables are not supported in device function "Eigen::internal::manage_caching_sizes"
    static CacheSizes m_cacheSizes;

but from the error it’s clear what is the issue (in Eigen). Blindly removing static variable makes compilation successful and I am able to run the binary without “cudaGetSymbolAddress returned” error message. Tomorrow we will more actual application and see what we get.

Thank you very much for a quick response!

MatColgrove · November 30, 2021, 7:54pm

Thanks, I was able to clone the repo using this link.

The problem here is that global variables accessed directly within device functions need to have a corresponding device global variable. Although declared within the function, adding “static” causes the variable to have global storage in order to be persistent between calls.

To create the global device variable in OpenMP, you would enclose the variable in a “declare target” region. However this can’t be done within the function itself. Hence I suggest moving the declaration of m_cacheSizes before the declaration of the function. Something like:


#pragma omp declare target
  static CacheSizes m_cacheSizes;
#pragma omp end declare target

/** \internal */
EIGEN_DEVICE_FUNC
inline void manage_caching_sizes(Action action, std::ptrdiff_t* l1, std::ptrdiff_t* l2, std::ptrdiff_t* l3)
{

Now it’s very possible this is actually the cause of the original error and this is the symbol that it can’t find.

pramod.s.kumbhar · November 30, 2021, 8:38pm

Perfect! that allow me to get above reproducer/test running. I quickly switched back to our actual application and tried to compile with above changes (i.e. -cuda flag and change in Eigen header) but got following error at compile time:

nvc++ -mp -g  -O1 --c++14 -acc -mp=gpu -gpu=cuda11.4,cc70 -cuda   \
 -D<some_app_related_defines> \
 -I<some_app_related_includes> \
 -c x86_64/corenrn/mod2c/kca.cpp -o x86_64/corenrn/build/kca.o
...
compilers/share/llvm/bin/llc: error: /.../share/llvm/bin/llc: /gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/96719/nvc++1gAnFiGfrehy.ll:3651:21: 
error: use of undefined value '%L.B0474'
        br i1  %113, label %L.B0474, label %L.B0482, !llvm.loop !9002, !dbg !8961

I tried with -O1 / -O2 and got the same error. This error doesn’t appear if -cuda is not used.

We will need some time to get standalone reproducer. Do you have any suggestion in the meantime?

(Just to mention - we are working on OpenACC to OpenMP migration as part of NERSC GPU Hackathon which is starting on 2nd December. I assume we will be working with some NVIDIA colleagues as part of this Hackathon. In case this helps for faster feedback cycle or helps to look at the issue together)

MatColgrove · December 1, 2021, 1:24am

Looks like code gen issue with the backend compiler. Some label is getting used without being declared. Unfortunately we’ll need a reproducer in order to investigate. You might try removing “-g” or going to a higher opt level like -O2 or -O3 to see if it get optimized away.

I had conflict so wasn’t able to mentor at the NERSC hackathon, but Brent from my team (NV HPC) will be there, though he tends to avoid C++.

pramod.s.kumbhar · December 1, 2021, 10:05pm

You might try removing “-g” or going to a higher opt level like -O2 or -O3 to see if it get optimized away.

I tried that but it ended up with the same errors.

I had conflict so wasn’t able to mentor at the NERSC hackathon, but Brent from my team (NV HPC) will be there, though he tends to avoid C++.

Ok 👍 We will get in touch with Brent if there will be something of a worth discussion. Most of our compute kernels related code is C-style.

Topic		Replies	Views
Cannot dynamically load a shared library containing both OpenACC and CUDA code nvc, nvc++ and nvfortran	8	2688	August 24, 2022
OpenMP: cuModuleGetGlobal returned error 500 nvc, nvc++ and nvfortran	9	885	November 1, 2021
OpenMP offload w/ CUDA interop: undefined reference to `__fatbinwrap__NV_MODULE_ID' nvc, nvc++ and nvfortran	5	1061	May 22, 2023
Nvc++ OpenMP error inside llc nvc, nvc++ and nvfortran	5	1118	June 1, 2021
Regression with NVHPC 22.7 and OpenACC offload kernels nvc, nvc++ and nvfortran	3	397	October 4, 2022
Issue with locally defined classes in OpenMP offload region (since NVHPC 22.5) nvc, nvc++ and nvfortran	7	1064	March 31, 2023
Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded nvc, nvc++ and nvfortran	8	1253	October 7, 2023
Behaviour of OpenMP target maps with Fortran arrays nvc, nvc++ and nvfortran	12	71	February 11, 2025
How to map private dynamic array to the GPU with OpenMP and nvc? nvc, nvc++ and nvfortran	20	133	January 31, 2025
[nvhpc-22.2] error: use of undefined value '%L.LB26_8163' nvc, nvc++ and nvfortran	27	2898	July 7, 2023

Enabling OpenMP offload breaks OpenACC code

Related topics