Creating a shared library that utilises OpenMP offloading

andreas.gocht · December 28, 2021, 9:54pm

Hey there,

I am trying to create a shared library utilising OpenMP offloading and nvc++. Unfortunately, I ran into some trouble using the library.

Let’s consider the following files:

test_compute.cpp // does some computation with OpenMP offloading
test.cpp // has int main() and uses the computation

Creating the library works fine:

nvc++ -g -O3 -std=c++17 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so

Compiling and linking with test.cpp works also:

nvc++ -std=c++17 test.cpp -o test -L${PWD} -ltest_compute

However, when I execute the code, using OMP_TARGET_OFFLOAD=MANDATORY, I’ll get the following error:

$ LD_LIBRARY_PATH=${PWD}:$LD_LIBRARY_PATH OMP_TARGET_OFFLOAD=MANDATORY ./test
Fatal error: Could not run target region on device 0, execution terminated.
Aborted

When I add -mp=gpu to the compilation and linking of test.cpp:

nvc++ -std=c++17 -mp=gpu test.cpp -o test -L${PWD} -ltest_compute

everything works fine:

$ LD_LIBRARY_PATH=${PWD}:$LD_LIBRARY_PATH OMP_TARGET_OFFLOAD=MANDATORY ./test
//some output

However, as I’d like to create a library that might be linked by any other program or compiler or even work as a Python module, specifying -mp=gpu is not a solution for me.

Is there any way to make the library “self-contained”, i.e. I do not need to add -mp=gpu while compiling test.cpp?

If needed, I can also provide some test files.

I am using NVHPC 21.5:

$ nvc++ --version

nvc++ 21.5-0 LLVM 64-bit target on x86-64 Linux -tp zen
NVIDIA Compilers and Tools
Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.

Best,

Andreas

MatColgrove · December 29, 2021, 8:23pm

Hi Andreas,

Unfortunately, I don’t believe we have support for this when using OpenMP Target Offload as of yet. Though, I added an RFE (TPR #31121) and sent it to engineering to see what they can do.

Note that we do have this support in for OpenACC, so you may consider using it instead, at least until support can be added for OpenMP as well.

Example:

% cat test_compute.h
void addone(double * Arr, int sze);

% cat test_compute.cpp
#include "test_compute.h"

void addone(double * Arr, int sze) {
#pragma omp target teams distribute parallel for map(tofrom:Arr[0:sze])
#pragma acc parallel loop copy(Arr[0:sze])
for (int i=0; i<sze; ++i) {
Arr[i]+=1.0;
}
}

% cat test.cpp

#include <iostream>
#include <cstdlib>
#include "test_compute.h"

int main () {

int sze = 1024;
double * Arr = new double[sze];
for (int i=0; i < sze; ++i) {
Arr[i] = i;
}

addone(Arr,sze);

for (int i=0; i < 10; ++i) {
std::cout << i << ": " << Arr[i] << std::endl;
}


}

% nvc++ -g -O3 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so
% export OMP_TARGET_OFFLOAD=MANDATORY
% g++ test.cpp -L. -ltest_compute -o test; ./test
Fatal error: Could not run target region on device 0, execution terminated.
Abort
% nvc++ test.cpp -L. -ltest_compute -o test; ./test
Fatal error: Could not run target region on device 0, execution terminated.
Abort
% nvc++ test.cpp -L. -ltest_compute -o test -mp=gpu; ./test
0: 1
1: 2
2: 3
3: 4
4: 5
5: 6
6: 7
7: 8
8: 9
9: 10

// Ok if using OpenACC:

% nvc++ -g -O3 -fpic -acc=gpu -shared test_compute.cpp -o libtest_compute.so
% export NV_ACC_TIME=1
% g++ test.cpp -L. -ltest_compute -o test ; ./test
0: 1
1: 2
2: 3
3: 4
4: 5
5: 6
6: 7
7: 8
8: 9
9: 10

Accelerator Kernel Timing data
test_compute.cpp
_Z6addonePdi NVIDIA devicenum=0
time(us): 49
3: compute region reached 1 time
3: kernel launched 1 time
grid: [8] block: [128]
device time(us): total=5 max=5 min=5 avg=5
elapsed time(us): total=360 max=360 min=360 avg=360
3: data region reached 2 times
3: data copyin transfers: 1
device time(us): total=23 max=23 min=23 avg=23
8: data copyout transfers: 1
device time(us): total=21 max=21 min=21 avg=21

andreas.gocht · January 3, 2022, 9:59am

Thanks for the fast reply. I’m not sure if we are going to look into OpenACC. However, thanks for the hint to the possibility of co usage.

Best,

Andreas

system · March 21, 2022, 12:29pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

MatColgrove · May 25, 2022, 7:41pm

Hi Andreas,

Our team was able to add this support in the 22.5 release. I’ve confirmed that my simple tests work as expected, but please let us know if you see any issues when using it in you’re larger code.

-Mat

Topic		Replies	Views
Creating a shared library that utilises OpenMP offloading NVHPC 22.5 nvc, nvc++ and nvfortran	5	710	June 23, 2022
Nvc++ OpenMP error inside llc nvc, nvc++ and nvfortran	5	1104	June 1, 2021
Problem with the nvc++ compiler for OpeMP GPU offloading nvc, nvc++ and nvfortran	2	564	March 10, 2023
OpenMP offload w/ CUDA interop: undefined reference to `__fatbinwrap__NV_MODULE_ID' nvc, nvc++ and nvfortran	5	967	May 22, 2023
Enabling OpenMP offload breaks OpenACC code nvc, nvc++ and nvfortran	6	1253	December 1, 2021
Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded nvc, nvc++ and nvfortran	8	1252	October 7, 2023
Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work nvc, nvc++ and nvfortran	5	853	April 13, 2022
C++ Smart Pointers and OpenACC nvc, nvc++ and nvfortran nvcc	3	311	July 31, 2024
Improving compiler error with OpenACC + OpenMP: "Internal compiler error. confused OMP private processing" nvc, nvc++ and nvfortran	1	436	October 18, 2021
OpenMP offload with -gpu=nordc doesn't launch kernels on GPU nvc, nvc++ and nvfortran	1	725	October 19, 2021

Creating a shared library that utilises OpenMP offloading

Related topics