OpenMP offload w/ CUDA interop: undefined reference to `__fatbinwrap__NV_MODULE_ID'

I would like to use OpenMP device offload and CUDA in the same program. The code is templated so they need to compile together. I’ve got a simple makefile based reproducer, about 100 lines of code, here : GitHub - burlen/cuda_plus_offload: test code to see if it possible to compile both CUDA kernells and OpenMP device offload in the same program
Depending on pre-processor defines the code can be compiled with CUDA only, OpenMP offload only, or both CUDA and OpenMP offload. I’ve been trying to use nvcc w/ the host compiler set to nvc++. These both come from the HPC SDK 2023 version. The CUDA only and OpenMP only cases indivualy work. However, with CUDA and OpenMP together, it compiles but fails to link.

nvcc -g -G --generate-code=arch=compute_75,code=[compute_75,sm_75]  -lcuda -lcudart -lcudadevrt -ccbin=`which nvc++` -DCUMP_USE_OPENMP -DCUMP_USE_CUDA -Xcompiler -g,-Mcuda,-mp=gpu,-gpu=cc75,-Minfo=mp,-lcuda,-lcudart,-lcudadevrt,-Mcuda,-pgf90libs -Xlinker -lcuda,-lcudart,-lcudadevrt -x cu main.cpp -o cump_both_nvhpc
void init_omp<float>(float*, unsigned long, float const&):
      1, include "stl_construct.h"
          33, #omp target teams loop
              33, Generating "nvkernel__Z8init_ompIfEvPT_mRKS0__F16399L33_2" GPU kernel
                  Generating NVIDIA GPU code
                35, Loop parallelized across teams, threads(128) /* blockIdx.x threadIdx.x */
              33, Generating Multicore code
                35, Loop parallelized across threads
/bin/ld: /tmp/tmpxft_00282bd1_00000000-13_cump_both_nvhpc_dlink.o: in function `__cudaRegisterLinkedBinary__NV_MODULE_ID':
/tmp/tmpxft_00282bd1_00000000-7_cump_both_nvhpc_dlink.reg.c:2: undefined reference to `__fatbinwrap__NV_MODULE_ID'
pgacclnk: child process exit status 1: /bin/ld
make: *** [Makefile.nvhpc:12: cump_both_nvhpc] Error 2

Can anyone help on the link options needed?

Hi bloring,

The issue here is that nvcc has relocatable device code (RDC) generation disabled by default, but nvc++ has it enabled by default. Hence, you need to disable it with nvc++ by adding the “-gpu=nordc” flag.

While this will get the program to link, you’ll get a runtime error. The problem being that nordc isn’t supported with OpenMP target offload as of yet (we need to add the hooks to the device initialization), so the code falls back to running on the host. It then segvs because the device “ptr” is being accessed on the host.

Hence you have two options, use OpenACC instead of OpenMP which does support “nordc”, or use nvc++ to compile the CUDA C code. The caveat being the nvc++ doesn’t fully support CUDA, but does work with this example.

% nvc++ -fast -cuda -mp=gpu -DCUMP_USE_OPENMP -DCUMP_USE_CUDA main.cpp -o cump_both_nvhpc
% ./cump_both_nvhpc
running w/ cuda
3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14,
running w/ openmp offload
3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14,

Hope this helps,
Mat

Mat, Many thanks!

What about going the other way, -rdc=true to nvcc? If I recall correctly the real code of interest here uses this option (I should know but CMake uses different terminology). I tried adding -rdc=true on the reproducer, and the link succeeded but then got a segv before main (stack below)

Is the level of CUDA support in nvc++ documented somewhere? I’d just give it try but the real code uses CMake , and I anticipate much pain trying to get CMake to successfully use something other than nvcc on CUDA code. Would rather have some idea if it’s viable before fighting CMake.

Adding -rdc=true gets it to compile & link, but crash at runtime before main.

(cuda-gdb) start
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff58c8e7a in __memset_sse2_unaligned_erms () from /usr/lib/gcc/x86_64-redhat-linux/12//../../../../lib64/libc.so.6
(cuda-gdb) where
#0  0x00007ffff58c8e7a in __memset_sse2_unaligned_erms () from /usr/lib/gcc/x86_64-redhat-linux/12//../../../../lib64/libc.so.6
#1  0x00007ffff5a4f2ab in __pgi_uacc_smallmem (n=n@entry=6148914691236517240) at ../../src/smallmem.c:53
#2  0x00007ffff522135e in __pgi_uacc_cuda_register_fat_binary (pgi_cuda_loc=pgi_cuda_loc@entry=0x410280 <__PGI_CUDA_LOC>) at ../../src/cuda_init.c:641
#3  0x00007ffff52248b0 in __pgi_mcuda_register_module_functions (pgi_cuda_loc=0x410280 <__PGI_CUDA_LOC>) at ../../src/mcuda_register.c:157
#4  0x00007ffff52235eb in __pgi_uacc_cuda_load_main_module (global_pgi_cuda_loc=<optimized out>, global_pgi_cuda_cap=<optimized out>) at ../../src/cuda_init.c:1944
#5  0x0000000000403478 in __pgi_uacc_set_shared ()
#6  0x0000000000403025 in _init ()
#7  0x00007fffffffd1e0 in ?? ()
#8  0x00007ffff584a5f0 in __libc_start_main_impl () from /usr/lib/gcc/x86_64-redhat-linux/12//../../../../lib64/libc.so.6
#9  0x0000000000403455 in _start ()

Thanks again for quick help
Burlen

Hi Burlen,

Unfortunately using “-rdc true” won’t work either. The problem here is that nvc++ needs to create code to register the OpenMP kernels, which doesn’t occur when using nvcc as the driver.

My error is a bit different than yours, but the segv is in the call that’s registering the modules. Here’s the error I see:

% ./cump_both_nvhpc
running w/ cuda
 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14,
 running w/ openmp offload
 Module function not found, error 500
 Accelerator Fatal Error: Failed to find device function 'nvkernel__Z8init_ompIfEvPT_mRKS0__F19132L33_2'! File was compiled with: -gpu=cc70
Rebuild this file with -gpu=cc70 to use NVIDIA Tesla GPU 0
Rebuild this file with -gpu=cc70 to use NVIDIA Tesla GPU 1
Rebuild this file with -gpu=cc70 to use NVIDIA Tesla GPU 2
Rebuild this file with -gpu=cc70 to use NVIDIA Tesla GPU 3
File: /local/home/mcolgrove/cuda_plus_offload/main.cpp
Function: _Z8init_ompIfEvPT_mRKS0_:33
Line: 33

Is the level of CUDA support in nvc++ documented somewhere?

We implemented enough to be able to compile the Thrust library, which our C++ STDPAR implementation is built on, but don’t have a comprehensive list on what exactly is or isn’t supported.

The biggest item is that nvc++ can’t support “__CUDA_ARCH__” given nvc++ is a single pass compiler. Bryce Lelbach gives a good presentation on this, starting around the 15 minute mark: https://www.youtube.com/watch?v=KhZvrF_w1ak&t=1625s

Mat, Thank you so much. This is super helpful!

Burlen

The nvc++ compiler worked when using both CUDA and OpenMP in the same translation unit.

This required to refactor our CMake codes to set our internal CUDA preprocessor defines that enable CUDA related code paths in our sources when the nvhpc compiler is detected, while keeping CMake from enabling its “CUDA language” features, and instead passing -cuda -mp=gpu in the C++ compiler flags.

Having the ability to use both CUDA and OpenMP in the same translation unit is an essential feature for my code. I’m so glad this is supported.