nvfortran+OpenACC: Device code link error

Hi everyone,

We have a device code linker issue that happens in a rather big test application GitHub - ecmwf-ifs/dwarf-p-cloudsc: Standalone mini-app of the ECMWF cloud microphysics parameterization (actually it is a mini-app related to microphysics from ECMWF climate models).

We can’t find a test case smaller than this, but we hope it is easy enough to build and analyze this code (the set of scripts to build and run it is attached).

cloudsc-build.public.tar.gz (28.6 KB)

The problem is: when we choose to build the application by linking the internal library code dynamically, we face this error:

[100%] Linking Fortran executable ../../../bin/dwarf-cloudsc-gpu-scc
nvlink error   : Undefined reference to '_yomcst_21' in 'CMakeFiles/dwarf-cloudsc-gpu-scc.dir/cloudsc_gpu_scc_mod.F90.o'
nvlink error   : Undefined reference to '_yoethf_21' in 'CMakeFiles/dwarf-cloudsc-gpu-scc.dir/cloudsc_gpu_scc_mod.F90.o'
pgacclnk: child process exit status 2: /gpfs/apps/MN5/ACC/NVIDIA-HPC-SDK/25.7/Linux_x86_64/25.7/compilers/bin/tools/nvdd

This disappears if we link the code statically. No such error for OpenMP version of the same code, as well as for CUDA version of it. I wonder if this can be reproduced and addressed in some way.

To build things, one should create dnb-xxx.yaml similar to dnb-mn5-acc.yaml that is given there as an example. At runtime, the wrapper script is assumed: similar to scripts/generic/mn5-acc/runner-script.sh – it handles various affinity aspects. To execute, it is handy to use psubmit.sh script system that has generic interface for SLURM, MPI and does convenient pre-post-processing for the data required/produced by the executable.

The linker error will appear if one changes DNB_CLOUDSC_WITH_DYNAMIC_LINK=FALSE to DNB_CLOUDSC_WITH_DYNAMIC_LINK=TRUE in the overrides.yaml

The issue seems to be a long standing one since we have a record of the old forum topic: https://www.pgroup.com/userforum/viewtopic.php?t=7296

This case also makes us suspicious that device code shared library linking has more issues particularly in the OpenACC case, because in our bigger code that can’t be shared due to its licensing limitations we have memory corruption issues when the library containing any OpenACC code (even the CPU version of the offloading, and even not containing actual OpenACC kernels, just being build with OpenCC options; is also fine with OpenMP target offloading build options, or when linked statically).

Hi Alexey,

This error typically means that device versions of the module variables aren’t being created. This is done by adding the “acc declare” directive but I don’t see these in the YOMCST or YOETHF modules.

I didn’t check which of the following variables are used in the cloudsc_gpu_scc_mod.F90’s device routines, but here’s the variables includes in the use statement:

   USE YOMCST, ONLY: RG, RD, RCPD, RETV, RLVTT, RLSTT, RLMLT, RTT, RV
    USE YOETHF, ONLY: R2ES, R3LES, R3IES, R4LES, R4IES, R5LES, R5IES, R5ALVCP, R5ALSCP, RALVDCP, RALSDCP, RALFDCP, RTWAT, RTICE,  &
    & RTICECU, RTWAT_RTICE_R, RTWAT_RTICECU_R, RKOOP1, RKOOP2

Besides adding “acc declare create(<vars>)”, you’ll also want to add “acc update” when these variables are assigned on the host to ensure the values are correct on the device.

If that doesn’t work, can you please give more details on the build process? I tried to follow the instructions in the README.md, but the configuration seemed to system specific.

I did try using the following, but even after editing the env.sh file to better match my environment, the cmake commands failed.

./cloudsc-bundle create
./cloudsc-bundle build --clean --with-mpi --with-gpu --arch=./arch/ecmwf/hpc2020/nvhpc/22.1

-Mat

Hi Mat,

In the corrected archive please find the script that is not machine-specific but is meant to be generic. There is a chance that it does not require correction, you just: 1) load necessary modules for NVIDIA SDK and cmake; 2) do “./dnb.sh :du“; 3) do “./dnb.sh“; 4) change the linking parameter in the overrides.yaml to “TRUE“, redo “./dnb.sh“ and have the above described linker error.

cloudsc-build.public.v2.tar.gz (29.0 KB)

I’ll try to to add the “acc declare create” stuff and see if it changes the picture, but meanwhile we’d like to develop some understanding on the mechanisms that are behind this. Static linking for the same object files is OK, but dynamic linking fails – why? Which additional actions are made during dynamic linking? How could I guess in advance that those module variables the “acc declare create” is required?

–Alexey

Thanks! I’m in meetings most of the morning but will take a look in a bit.

I’m not sure. My best guess is that the device compiler can implicitly generate the device module variables when all the objects are linked together (via gpu LTO - link time optimization). However when the module is in a shared object, the device symbols can’t be implicitly discovered (LTO doesn’t work with SOs) and hence aren’t included. I’ll try to confirm when I can recreate the issue.

How could I guess in advance that those module variables the “acc declare create” is required?

When there’s direct access to a module variable from a device subroutine, the module variable should be included in a declare directive. The compiler should give feedback messages when it’s missing. I’ll check if this is happening here.

Which additional actions are made during dynamic linking?

If this was a C++ code, then you’d need to link with “-gpu=nordc”, but with Fortran and C we can do the device linking (RDC - relocatable device code) on the shared object itself.

Now if my guess is correct above, then you might not be able to rely on the compiler to implicit generate device routines or add global device symbols, but otherwise I wouldn’t expect anything extra.

Hi Alexey,

I was able to reproduce the error but still working on understanding what’s going on.

Though when I tried to add the flag “-Minfo=accel” to the flagset, it’s not being adding. I created a “dnb-xxx.yaml” and added it there, but also “dnb-generic.yaml” and “dnb-mn5-acc.yaml”. I seem to be missing a step on how to configure the scripts to use the yaml file. The build doesn’t use the flags you have in there either, just “-acc”.

The good news is that the “./cloudsc-v1.5.3.src/src/common/module/yomcst.F90” file does have the “declare” directive as well as the “update” directive in the YOMCST_LOAD_PARAMETERS subroutine. I missed this earlier, but is what I expect.

Still working on it, but out of time for today.

-Mat

Hi Mat,

Thanks for your effort.

I have a bug in the dnb.sh script that resulted in ignoring compiler flags given in yaml files. Here is a patch that corrects this, and also instructs cmake to dump all the commands executed by make:

--- dnb.sh	2025-09-23 17:02:35.472818450 +0200
+++ dnb.sh	2025-10-02 12:26:11.492955086 +0200
@@ -15,7 +15,7 @@
         FLAGS+=" --with-mpi"
         is_set_to_true DNB_CLOUDSC_WITH_GPU && FLAGS+=" --with-gpu"
         mkdir -p ./arch/dnb
-        here-document ./arch/dnb/toolset.cmake << ________EOF
+        here-document ./arch/dnb/toolchain.cmake << ________EOF
             set( ECBUILD_FIND_MPI ON )
             set( OpenMP_Fortran_FLAGS   "$DNB_CLOUDSC_FCFLAGS_OMP" CACHE STRING "" )
             set( OpenMP_C_FLAGS         "$DNB_CLOUDSC_CFLAGS_OPENMP" CACHE STRING "" )
@@ -30,12 +30,13 @@
             set( ECBUILD_CXX_FLAGS "$DNB_CLOUDSC_CXXFLAGS" )
 ________EOF
         here-document ./arch/dnb/env.sh << ________EOF
-            export ECBUILD_TOOLCHAIN="./toolchain.cmake"
+            export ECBUILD_TOOLCHAIN="$PWD/arch/dnb/toolchain.cmake"
 ________EOF
         FLAGS+=" --arch=dnb" 
         local CMAKE_FLAGS=""
         CMAKE_FLAGS+=" ENABLE_MPI=ON"
         CMAKE_FLAGS+=" ENABLE_HDF5=ON"  
+        CMAKE_FLAGS+=" CMAKE_VERBOSE_MAKEFILE=ON"
         if is_set_to_true DNB_CLOUDSC_WITH_GPU; then
             CMAKE_FLAGS+=" ENABLE_CUDA=OFF"
             is_set_to_true DNB_CLOUDSC_WITH_DYNAMIC_LINK && CMAKE_FLAGS+=" ENABLE_GPU_FORCE_SHARED=ON"

I’d recommend making changes in the command line args definition directly in the dnb-generic.yaml (and having the link machine.yaml → dnb-generic.yaml) – I think no need for dnb-xxx.yaml for our purposes. So, in my case, I simply added “-Minfo=accel” to DNB_CLOUDSC_FCFLAGS_ACC in the dnb-generic.yaml, and it works now.

PS One more recommendation:
You can run ./dnb.sh cloudsc instead of ./dnb.sh at build stage to avoid re-building hdf5 all the time.

–Alexey

Thanks, that worked. I checked all the usual potential issue like “-acc” isn’t on the library link, but it seems to check out. Also, using cuobjdump, I can see the symbols in the shared object, but for some reason the device linker can’t seem to resolved them.

I’m asking around the GPU team if they have any ideas.

Here’s the output I’m seeing. Note that I’ve tried to simplify the link just in case some other flag was causing the issue.

The link error:

% nvfortran -acc=gpu "CMakeFiles/dwarf-cloudsc-gpu-scc-k-caching.dir/dwarf_cloudsc_gpu.F90.o" "CMakeFiles/dwarf-cloudsc-gpu-scc-k-caching.dir/cloudsc_driver_gpu_scc_k_caching_mod.F90.o" "CMakeFiles/dwarf-cloudsc-gpu-scc-k-caching.dir/cloudsc_gpu_scc_k_caching_mod.F90.o" ../../../lib/libcloudsc-common-lib.so 
nvlink error   : Undefined reference to '_yomcst_21' in 'CMakeFiles/dwarf-cloudsc-gpu-scc-k-caching.dir/cloudsc_gpu_scc_k_caching_mod.F90.o'
nvlink error   : Undefined reference to '_yoethf_21' in 'CMakeFiles/dwarf-cloudsc-gpu-scc-k-caching.dir/cloudsc_gpu_scc_k_caching_mod.F90.o'
pgacclnk: child process exit status 2: /proj/nv/Linux_x86_64/286274-dev/compilers/bin/tools/nvdd

Here’s the reference in the cloudsc_gpu_scc_k_caching_mod.F90.o:

% cuobjdump -elf CMakeFiles/dwarf-cloudsc-gpu-scc-k-caching.dir/cloudsc_gpu_scc_k_caching_mod.F90.o | grep yomcst                              0x1b               0            0x48     0x1d     0x20      0     _yomcst_21
0x6210    _yomcst_21    R_CUDA_ABS32_HI_32    0x0
0x61f0    _yomcst_21    R_CUDA_ABS32_LO_32    0x0
0xce0    _yomcst_21    R_CUDA_ABS32_HI_32    0x0
0xcc0    _yomcst_21    R_CUDA_ABS32_LO_32    0x0
.extern .global .align 16 .b8 _yomcst_21[72];
ld.global.v2.f64        {%fd814, %fd815}, [_yomcst_21];
ld.global.v2.f64        {%fd816, %fd817}, [_yomcst_21+16];
ld.global.v2.f64        {%fd826, %fd827}, [_yomcst_21+48];
ld.global.v2.f64        {%fd832, %fd833}, [_yomcst_21+32];
ld.global.f64   %fd36, [_yomcst_21+56];

And here’s the symbol in the shared object:

% cuobjdump -elf ../../../lib/libcloudsc-common-lib.so | grep yomcst                                                                           0xd            0xb0            0x48     0x11        0    0xc     _yomcst_21
.common .global .align 16 .b8 _yomcst_21[72];
Skipping .debug_frame section, as length was 0

Note sometimes the link order can cause this, but the library should be at the end. Just in case, I moved it’s placement to different spots and even had it added multiple times, but still no luck.

Let’s see if the GPU team has ideas for me.

@MatColgrove Hi Mat, may I ask if there is any update on this case?