Creating a shared library that utilises OpenMP offloading NVHPC 22.5

Dear all,

as described in Creating a shared library that utilises OpenMP offloading there was a bug in NVHPC that did not allow the usage of offloading in libraries. The last post from @MatColgrove claimed that this should work now with NVHPC 22.5 and asked for feedback if I do see errors.
Unfortunately my simple test did not work:

$ nvc++ --version

nvc++ 22.5-0 64-bit target on x86-64 Linux -tp zen2
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
$ nvc++ -g -O3 -std=c++17 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so
$ nvc++ -std=c++17 test.cpp -o test -L${PWD} -ltest_compute
$ LD_LIBRARY_PATH=${PWD}:$LD_LIBRARY_PATH OMP_TARGET_OFFLOAD=MANDATORY ./test
Fatal error: Could not run target region on device 0, execution terminated.
Aborted

Moreover, if I add -mp=gpu to the compilation of test.cpp, which used to work before, I now get a wired error:

$ nvc++ -std=c++17 -mp=gpu test.cpp -o test -L${PWD} -ltest_compute
$ LD_LIBRARY_PATH=${PWD}:$LD_LIBRARY_PATH OMP_TARGET_OFFLOAD=MANDATORY ./test
Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0)
Aborted

I am unsure if this is an error on our site (we are running Driver Version: 470.57.02, CUDA Version: 11.4, A100 SMX) or if this should work. Please find my source code attached.

Best,

Andreas

test_compute.hpp (47 Bytes)
test_compute.cpp (413 Bytes)
test.cpp (686 Bytes)

Hi Andreas,

It’s an error in your code.

#pragma omp target data map(tofrom:F_x[n], X[n])

This says to copy a single element ‘n’ for the “F_x” and “X” arrays. Changing this to a range, will fix the error.

#pragma omp target data map(tofrom:F_x[:n], X[:n])

Hope this helps,
Mat

% cat test_compute.cpp

#include "test_compute.hpp"

void compute(double* X, double* F_x, int n)
{
#ifdef WORKS
    #pragma omp target data map(tofrom:F_x[:n], X[:n])
#else
// original failing verson
    #pragma omp target data map(tofrom:F_x[n], X[n])
#endif
    {
        #pragma omp target teams distribute parallel for
        for (int k = 0; k < n; k++)
        {
            for (int l = 0; l < n; l++)
            {
                F_x[k] += (X[l] <= X[k]);
            }
            F_x[k] /= n;
        }
    } // #pragma omp target
}
% nvc++ -g -O3 -std=c++17 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so
% nvc++ -mp=gpu -std=c++17 test.cpp -o test -L./ -ltest_compute
% echo $OMP_TARGET_OFFLOAD
MANDATORY
% ./test
Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0)
Abort
% nvc++ -g -O3 -std=c++17 -fpic -mp=gpu -shared test_compute.cpp -DWORKS -o libtest_compute.so
% ./test

0.4,0.1,0.3,0.8,0.9,1,0.5,0.6,0.2,0.7

Hey,

Thanks a lot. I should have seen that ^^. That actually fixed the wired CUDA error.

However, the shared library part still does not work. I.e. if you do not add -mp=gpu to the compilation of text.cpp, the execution still fails:

$ nvc++ -g -O3 -std=c++17 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so
$ nvc++ -std=c++17 test.cpp -o test -L${PWD} -ltest_compute
$ LD_LIBRARY_PATH=${PWD}:$LD_LIBRARY_PATH OMP_TARGET_OFFLOAD=MANDATORY ./test
Fatal error: Could not run target region on device 0, execution terminated.
Aborted

As I am working on a python extension, I cannot add -mp=gpu to the primary executable (i.e. the python binary).

Best,

Andreas

Yes, that’s correct when using nvc++ to link. The problem being that without -mp=gpu, the binary initialization is set to not use the GPU, over ridding the GPU initialization in the shared object.

If using g++ to link (and presumably with python), the shard object GPU initialization will kick in when the library is loaded.

% nvc++ -V22.5 -g -O3 -fpic -mp=gpu -shared test_compute.cpp -o libtest_compute.so
% g++ test.cpp -L. -ltest_compute -o test
% ./test

0.9,0.3,0.2,0.5,0.1,1,0.4,0.7,0.8,0.6

Apologies if this was not clear.

-Mat

1 Like

Yes, that’s correct when using nvc++ to link. The problem being that without -mp=gpu, the binary initialization is set to not use the GPU, over ridding the GPU initialization in the shared object.

That makes sense, though it does not seem obvious to me.

If using g++ to link (and presumably with python), the shard object GPU initialization will kick in when the library is loaded.

I just double-checked and it works well. Thank you a lot 😄.

Best,

Andreas