We are working on adding support for OpenMP target offload to a code that currently supports OpenACC offload to GPUs.
We have found that some code, which works as expected when compiled with OpenACC, no longer executes if it is compiled with -mp=gpu — even if we do not replace any OpenACC directives with OpenMP directives.
The example that triggers the problem uses the Eigen linear algebra library. This is working in OpenACC; we are using a fork of Eigen with improved GPU support.
#include "Eigen/Core"
int main() {
#if defined(_OPENMP) && !defined(_OPENACC)
#pragma omp target teams distribute parallel for simd
#else
#pragma acc parallel loop
#endif
for(int i = 0; i < 1; ++i) {
Eigen::Matrix<double, 1, 1> F;
F.norm();
}
return 0;
}
If we compile this code in four different ways, as shown in this script:
#!/bin/bash
CFLAGS="-Ieigen -DEIGEN_DONT_VECTORIZE=1"
LDFLAGS="-cuda"
FLAGS_acc="-acc"
RESULT_acc=0
FLAGS_acc_omp_host="-acc -mp"
RESULT_acc_omp_host=0
FLAGS_acc_omp_dev="-acc -mp=gpu"
RESULT_acc_omp_dev=1
FLAGS_omp_dev="-mp=gpu"
RESULT_omp_dev=1
set -x
git clone --branch v3.5-alpha.1 git@github.com:BlueBrain/eigen.git
for config in acc acc_omp_host acc_omp_dev omp_dev
do
flags_name="FLAGS_${config}"
result_name="RESULT_${config}"
expected_result=${!result_name}
nvc++ ${CFLAGS} ${!flags_name} -c test.cpp -o test.${config}.o
nvc++ ${LDFLAGS} ${!flags_name} -o ${config} test.${config}.o
./${config}
if [[ $? != ${expected_result} ]]; then
echo "Unexpected result: $?"
fi
done
We see that the two configurations including -mp=gpu produce executables that fail at runtime:
Failing in Thread:1
call to cudaGetSymbolAddress returned error 13: Other
We were surprised to see -mp=gpu break working OpenACC code like this. Do you have any idea what could be going on?
The test system has NVHPC/21.9 and CUDA 11.4 installed, and contains V100 GPUs.
Short answer: add “-cuda” to you compilation as well as the link.
Longer answer:
Keep in mind that while from a user perspective OpenACC and OpenMP do similar things, the underlying implementation is very different. OpenMP creates outlined offload regions that are passed to the OpenMP runtime library while OpenACC inlines regions so has more upfront information about the region. Also our OpenACC support is very mature while OpenMP offload is very new.
I don’t have access to the alpha version of “BlueBrain/eigen.git” but was able to recreate the error using “libeigen/eigen.git” so hopefully recreated the correct thing.
As best I can tell, the problem seems to be some global device symbol (looks like a “this” pointer) from the Eigen library not being found. Adding “-cuda” to enable CUDA support in nvc++ and hence, I assume, exposes a “_device_” attribute in the Eigen library so the symbol can be resolve by the OpenMP runtime library.
% nvc++ -Ieigen -DEIGEN_DONT_VECTORIZE=1 test.cpp -mp=gpu -V21.9; a.out
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found
% nvc++ -Ieigen -DEIGEN_DONT_VECTORIZE=1 test.cpp -mp=gpu -V21.9 -cuda ; a.out
%
Short answer: add “-cuda” to you compilation as well as the link.
Ok. Adding -cuda produces following compilation error:
+ nvc++ -Ieigen -DEIGEN_DONT_VECTORIZE=1 -mp=gpu -cuda -c test.cpp -o test.omp_dev.o
"eigen/Eigen/src/Core/products/GeneralBlockPanelKernel.h", line 121: error: static variables are not supported in device function "Eigen::internal::manage_caching_sizes"
static CacheSizes m_cacheSizes;
but from the error it’s clear what is the issue (in Eigen). Blindly removing static variable makes compilation successful and I am able to run the binary without “cudaGetSymbolAddress returned” error message. Tomorrow we will more actual application and see what we get.
Thanks, I was able to clone the repo using this link.
The problem here is that global variables accessed directly within device functions need to have a corresponding device global variable. Although declared within the function, adding “static” causes the variable to have global storage in order to be persistent between calls.
To create the global device variable in OpenMP, you would enclose the variable in a “declare target” region. However this can’t be done within the function itself. Hence I suggest moving the declaration of m_cacheSizes before the declaration of the function. Something like:
Perfect! that allow me to get above reproducer/test running. I quickly switched back to our actual application and tried to compile with above changes (i.e. -cuda flag and change in Eigen header) but got following error at compile time:
I tried with -O1 / -O2 and got the same error. This error doesn’t appear if -cuda is not used.
We will need some time to get standalone reproducer. Do you have any suggestion in the meantime?
(Just to mention - we are working on OpenACC to OpenMP migration as part of NERSC GPU Hackathon which is starting on 2nd December. I assume we will be working with some NVIDIA colleagues as part of this Hackathon. In case this helps for faster feedback cycle or helps to look at the issue together)
Looks like code gen issue with the backend compiler. Some label is getting used without being declared. Unfortunately we’ll need a reproducer in order to investigate. You might try removing “-g” or going to a higher opt level like -O2 or -O3 to see if it get optimized away.
I had conflict so wasn’t able to mentor at the NERSC hackathon, but Brent from my team (NV HPC) will be there, though he tends to avoid C++.