I am using NVHPC 23.11 to compile a shared library that uses OpenACC for GPU offload. I have found that the -acc flag enables some sort of link-time optimization, because it results in a rather small relocation table (.rela.dyn) in the shared library. However, this breaks my application, which relies on the presence of a certain weak symbol in the library that disappears with -acc.
Removing the -acc flag from the link command line restores the missing entries from the relocation table, but of course breaks OpenACC support.
Is there a way to keep OpenACC support without removing entries from the relocation table?
Can you try compiling and linking with “-gpu=nordc”?
Relocatable Device Code (RDC) requires a device link step, including when creating shared objects. While I’m not sure, my guess is this linking is causing your issue.
The caveat being that RDC is needed when calling device routines found is separate files as well as accessing global device data. Also, the device code is stored in PTX form which is JIT compiled upon first execution. This gets cached, but does cause a bit of overhead the first time the binary is run.
Do you happen to have a reproducing example that I can use to investigate?
If not, no worries. More if you can describe the process you’re using (what flags you’re using to link), and what types of symbols need to be weak.
Note you can add the verbose flag (-v) to the link to see the link commands we use. You can ignore the “acclnk”, which is a wrapper to drive both the device and host link as well as merging the device binaries. Without RDC, only the “ld” line is run (as shown in the verbose output as part of the acclnk command).
Also previously, how are you defining the weak symbols? Are you using a linker script?
I’m wondering if there’s something in our ld command that’s inhibiting the creation of these symbols, or you need to add some additional linker options.
Now I don’t create shared objects very often myself, so not an expert here. But hopefully with more details we can figure out the best path forward.
Unfortunately I don’t have a minimal reproducer, because the actual setup is quite intricate and I cannot share the code.
For context, the weak symbol that I need is a singleton produced by the cereal library. Specifically, it is this object, instantiated as cereal::detail::StaticObject<cereal::detail::InputBindingMap<cereal::BinaryInputArchive> >::create()::t (aka _ZZN6cereal6detail12StaticObjectINS0_16OutputBindingMapINS_19BinaryOutputArchiveEEEE6createEvE1t). This singleton is declared multiple times across several shared libraries, resulting in several weak symbols. Cereal (and my application) requires that all the weak be merged into one at runtime.
I found something interesting. The bug only appears if I compile one of the shared libraries (say libA.so) with gcc 11 and another one (libB.so) with nvc++, with libB.so linking to libA.so. I’m not sure if this is relevant, but gcc compiles the singleton as a “unique global symbol” (type u):
❯ nm libA.so | rg '_ZZN6cereal6detail12StaticObjectINS0_16OutputBindingMapINS_19BinaryOutputArchiveEEEE6createEvE1t'
000000000034e540 u _ZZN6cereal6detail12StaticObjectINS0_16OutputBindingMapINS_19BinaryOutputArchiveEEEE6createEvE1t
I will try to add the verbose flag to the link command to get the actual ld command line.
If I am reading this correctly, libA.so appears after -no-as-needed, which seems safe. I can’t see anything particularly suspicious here, but I may be missing something.
Perhaps the lack of an entry in the relocation table is a red herring? I’m not 100% sure I have nailed down the root cause. Let me restate the problem:
libA.so is compiled with gcc 11;
libB.so is compiled with nvc++ 23.11;
libB.so links to libA.so;
both libraries contain the same singleton object, defined (inline) in a header file;
I expect the dynamic linker to notice that the weak symbols in libA.so and libB.so can be merged, but it doesn’t. This may be because libB.so lacks a relocation table entry for the singleton object, but it may also be a red herring.
I’ve been trying to reproduce the issue here with some simple codes, but haven’t had any luck. Then again, I don’t fully understand the issue, so not unexpected.
You did say that this works if you don’t compile/link with OpenACC (i.e. the “-acc”), so I’ve been looking at what’s different between the two.
One difference is that we include the linker script, “nvhpcloc.ld”, when linking with -acc. All this script is doing is setting the versioning for the entry points to the location of the embedded CUDA binaries/PTX code. Though this should just add the version to those two symbols so wouldn’t think it would matter. But I don’t know linker scripts that well, so maybe?
Other than that, the only other difference to the link line is the additional runtime libraries.
Are you able to build libA with nvc++? If so, I’d be curious what, if anything changes.
We’re object compatible with g++ and follow their name mangling, but maybe they link shared objects differently.
Really, the only remotely suspicious thing is --dynamic-list=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/lib/nvhpc.syms, but this file only contains a couple of symbols. I will try to run the ld command line by hand and remove the options incrementally. Does acclnk do anything else, besides orchestrating linking on host and device?
Compiling libA.so with nvc++ fixes the bug. I currently use this as a workaround, but since libA.so is largish and nvc++ is much slower than gcc for me, I (and my CI) would love to make this work!
Against all odds, removing --dynamic-list=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/lib/nvhpc.syms seems to fix the bug! What is going on?
The nvhpc.syms file only contains two symbols: acc_get_device_type and ompt_start_tool. Removing either one does not fix my bug. Also, if I replace the symbols with foo and bar, the bug is still present! If I remove the whole --dynamic-list option, then the bug is fixed. I have no idea what is going on.
I looked into --dynamic-list a bit. If I understand correctly, when this option is given, only symbols listed in the list will appear in the binary. So it makes sense that passing a dynamic list with foo and bar still triggers the bug, because it filters out all symbols but foo and bar.
@MatColgrove is there any way to remove this option from the ld command line? I don’t understand why it is necessary, but I’m sure there is a good reason.
Let me talk with engineering to see if I can get more details. Though in the meantime, you can try commenting out the “NEEDNVHPCLDSYMS=1” in the compiler’s OpenACC config file, “<install_dir>/<arch_dir>/<release>/compilers/bin/rcfiles/acc1rc”
i.e. change:
# Support for exposing `acc_get_device_type` in statically linked application, needed by LIBCUPTI
set(NEEDNVHPCLDSYMS=1)
to
# Support for exposing `acc_get_device_type` in statically linked application, needed by LIBCUPTI
# set(NEEDNVHPCLDSYMS=1)
Given the note, it looks like this was added in order of the device profiling library to work properly, but I’ll need to confirm that. If this is the only reason, then you may not be able to get a device profile, but at least get it link the way you need it.