Creating a shared library that utilises OpenMP offloading NVHPC 22.5

andreas.gocht · June 14, 2022, 11:41am

Dear all,

as described in Creating a shared library that utilises OpenMP offloading there was a bug in NVHPC that did not allow the usage of offloading in libraries. The last post from @MatColgrove claimed that this should work now with NVHPC 22.5 and asked for feedback if I do see errors.
Unfortunately my simple test did not work:

$ nvc++ --version

nvc++ 22.5-0 64-bit target on x86-64 Linux -tp zen2
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
$ nvc++ -g -O3 -std=c++17 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so
$ nvc++ -std=c++17 test.cpp -o test -L${PWD} -ltest_compute
$ LD_LIBRARY_PATH=${PWD}:$LD_LIBRARY_PATH OMP_TARGET_OFFLOAD=MANDATORY ./test
Fatal error: Could not run target region on device 0, execution terminated.
Aborted

Moreover, if I add -mp=gpu to the compilation of test.cpp, which used to work before, I now get a wired error:

$ nvc++ -std=c++17 -mp=gpu test.cpp -o test -L${PWD} -ltest_compute
$ LD_LIBRARY_PATH=${PWD}:$LD_LIBRARY_PATH OMP_TARGET_OFFLOAD=MANDATORY ./test
Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0)
Aborted

I am unsure if this is an error on our site (we are running Driver Version: 470.57.02, CUDA Version: 11.4, A100 SMX) or if this should work. Please find my source code attached.

Best,

Andreas

test_compute.hpp (47 Bytes)
test_compute.cpp (413 Bytes)
test.cpp (686 Bytes)

MatColgrove · June 14, 2022, 5:15pm

Hi Andreas,

It’s an error in your code.

#pragma omp target data map(tofrom:F_x[n], X[n])

This says to copy a single element ‘n’ for the “F_x” and “X” arrays. Changing this to a range, will fix the error.

#pragma omp target data map(tofrom:F_x[:n], X[:n])

Hope this helps,
Mat

% cat test_compute.cpp

#include "test_compute.hpp"

void compute(double* X, double* F_x, int n)
{
#ifdef WORKS
    #pragma omp target data map(tofrom:F_x[:n], X[:n])
#else
// original failing verson
    #pragma omp target data map(tofrom:F_x[n], X[n])
#endif
    {
        #pragma omp target teams distribute parallel for
        for (int k = 0; k < n; k++)
        {
            for (int l = 0; l < n; l++)
            {
                F_x[k] += (X[l] <= X[k]);
            }
            F_x[k] /= n;
        }
    } // #pragma omp target
}
% nvc++ -g -O3 -std=c++17 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so
% nvc++ -mp=gpu -std=c++17 test.cpp -o test -L./ -ltest_compute
% echo $OMP_TARGET_OFFLOAD
MANDATORY
% ./test
Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0)
Abort
% nvc++ -g -O3 -std=c++17 -fpic -mp=gpu -shared test_compute.cpp -DWORKS -o libtest_compute.so
% ./test

0.4,0.1,0.3,0.8,0.9,1,0.5,0.6,0.2,0.7

andreas.gocht · June 15, 2022, 7:08am

Hey,

Thanks a lot. I should have seen that ^^. That actually fixed the wired CUDA error.

However, the shared library part still does not work. I.e. if you do not add -mp=gpu to the compilation of text.cpp, the execution still fails:

$ nvc++ -g -O3 -std=c++17 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so
$ nvc++ -std=c++17 test.cpp -o test -L${PWD} -ltest_compute
$ LD_LIBRARY_PATH=${PWD}:$LD_LIBRARY_PATH OMP_TARGET_OFFLOAD=MANDATORY ./test
Fatal error: Could not run target region on device 0, execution terminated.
Aborted

As I am working on a python extension, I cannot add -mp=gpu to the primary executable (i.e. the python binary).

Best,

Andreas

MatColgrove · June 15, 2022, 2:36pm

Yes, that’s correct when using nvc++ to link. The problem being that without -mp=gpu, the binary initialization is set to not use the GPU, over ridding the GPU initialization in the shared object.

If using g++ to link (and presumably with python), the shard object GPU initialization will kick in when the library is loaded.

% nvc++ -V22.5 -g -O3 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so
% g++ test.cpp -L. -ltest_compute -o test
% ./test

0.9,0.3,0.2,0.5,0.1,1,0.4,0.7,0.8,0.6

Apologies if this was not clear.

-Mat

andreas.gocht · June 23, 2022, 6:24am

Yes, that’s correct when using nvc++ to link. The problem being that without -mp=gpu, the binary initialization is set to not use the GPU, over ridding the GPU initialization in the shared object.

That makes sense, though it does not seem obvious to me.

If using g++ to link (and presumably with python), the shard object GPU initialization will kick in when the library is loaded.

I just double-checked and it works well. Thank you a lot 😄.

Best,

Andreas

system · July 7, 2022, 6:25am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Creating a shared library that utilises OpenMP offloading nvc, nvc++ and nvfortran	4	1236	May 25, 2022
Nvc++ OpenMP error inside llc nvc, nvc++ and nvfortran	5	1104	June 1, 2021
Issue with locally defined classes in OpenMP offload region (since NVHPC 22.5) nvc, nvc++ and nvfortran	7	1050	March 31, 2023
OpenMP offload w/ CUDA interop: undefined reference to `__fatbinwrap__NV_MODULE_ID' nvc, nvc++ and nvfortran	5	967	May 22, 2023
Does nvc support GPU offloading with OpenMP nvc, nvc++ and nvfortran	2	892	December 14, 2020
Problem with the nvc++ compiler for OpeMP GPU offloading nvc, nvc++ and nvfortran	2	564	March 10, 2023
[nvhpc-22.2] error: use of undefined value '%L.LB26_8163' nvc, nvc++ and nvfortran	27	2857	July 7, 2023
Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work nvc, nvc++ and nvfortran	5	853	April 13, 2022
Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded nvc, nvc++ and nvfortran	8	1252	October 7, 2023
OMP offloading crash with nvc CUDA NVCC Compiler nvcc , offload-features	8	891	November 29, 2022

Creating a shared library that utilises OpenMP offloading NVHPC 22.5

Related topics