Accelerated Fortran stdpar code failing at runtime

I have a project where I’m trying to offload some loop to the GPU, it manages to compile but when running fails with

Current file:     /.../file.inc
        function: function_name
        line:     172
This file was compiled: -acc=gpu -gpu=cc80 -gpu=cc86

I’m struggling to find the cause of this, the line points to a do concurrent loop, I have tried export NVCOMPILER_TERM=trace but I get no additional information. Likewise, I am struggling also to reduce it to a minimal reproducible example because a single test file compiles and runs fine, for example

program doconcurrent
    implicit none
    integer :: i, n
    real :: a, x(100), y(100)

    n = 100
    a = 2.0
    x = [(real(i), i=1,n)]
    y = [(real(i), i=1,n)]

    do concurrent (i = 1:n) local(x, y)
        y(i) = y(i) + a*x(i)
    enddo

    print *, "Results:"
    do i = 1, n
        print *, "y(", i, ") = ", y(i)
    enddo
end program doconcurrent

compiled with nvfortran -stdpar=gpu -acc=gpu -gpu=cc80 -gpu=cc86 doconcurrent.f90 and nvidia-smi and nvidiaaccelinfo seem to output correctly

CUDA Driver Version:           12080
NVRM version:                  NVIDIA UNIX Open Kernel Module for x86_64  570.133.07  Release Build  (dvs-builder@U22-I3-G01-1-1)  Fri M
ar 14 12:57:14 UTC 2025

Device Number:                 0
Device Name:                   NVIDIA GeForce RTX 3060 Ti
Device Revision Number:        8.6
Global Memory Size:            8232108032
Number of Multiprocessors:     38
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1740 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             7001 MHz
Memory Bus Width:              256 bits
L2 Cache Size:                 3145728 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
Unified Memory:                HMM
Memory Models Flags:           -gpu=mem:separate, -gpu=mem:managed, -gpu=mem:unified
Default Target:                cc86
Wed May 14 09:12:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   34C    P8              7W /  220W |      32MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          181012      G   /usr/lib/xorg/Xorg                        9MiB |
|    0   N/A  N/A          181438      G   /usr/bin/gnome-shell                      3MiB |
+-----------------------------------------------------------------------------------------+

I don’t know if the fact that I’m using MPI might help narrow down the cause. Thank you in advance!

This error typically means that the runtime can’t find the GPU binary for the target found on the system.

Here, you do have the binary compiled for your target device. Though while rare, what can happen is that the runtime can’t find/open the CUDA Driver (i.e. libcuda.so) so it trying to run the host fallback code, which doesn’t exist.

Typically running with MPI isn’t an issue, but are you running in a different environment, like using a Slurm scheduler? Maybe the environment variable LD_LIBRARY_PATH needs to be set to include the path to libcuda.so?

Granted, I’m guessing and it doesn’t quite make sense so something else could be going on, but lets start with here.

Hello, thanks for the answer. I am running the code locally right now, without the Slurm scheduler. Just to be sure I reinstalled the NVHPC package with the tarball, and I am now using the nvhpc/25.3 module, I see that libcuda.so is both present on my system’s usr directory and in the NVHPC folder.

➜  arborescence-compil git:(stdpar) ✗ echo $LD_LIBRARY_PATH             
/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nvshmem/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/nccl/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/math_libs/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/extras/qd/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/cuda/lib64
➜  arborescence-compil git:(stdpar) ✗ sudo find /usr/ -name 'libcuda.so.*'
[sudo] password for eduard: 
/usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/i386-linux-gnu/libcuda.so.570.133.07
/usr/lib/i386-linux-gnu/libcuda.so.1
➜  arborescence-compil git:(stdpar) ✗ sudo find /opt/nvidia/ -name 'libcuda.so.*'
/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/cuda/12.8/compat/libcuda.so.1
/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/cuda/12.8/compat/libcuda.so.570.124.06
/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/REDIST/cuda/12.8/compat/libcuda.so.1
/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/REDIST/cuda/12.8/compat/libcuda.so.570.124.06

FYI, /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/cuda/12.8/compat/libcuda.sois a non-functional stub library, so if your LD_LIBRARY_PATH included it, then it would cause this error. But I don’t see it there, so unless your MPI is picking it up, then unlikely to be your issue.

/usr/lib/x86_64-linux-gnu/libcuda.so*is the correct driver and it would be highly unusual for this to not be in the loader’s default search path. But just in case, try adding this directory to your LD_LIBRARY_PATH.

Some other less common causes of this is when the Nouveau drivers are installed (not the case here), when using C++ shared objects not built with “-gpu=nordc”, or when on WSL and the CUDA driver is installed in a non-default location.

We did have a bug in 24.7, fixed in 24.11, when “-gpu=ccall” was used since “cc86” wasn’t included in the ‘all’ list. Hence I can’t completely discount a compiler issue, but if it were, I’d expect you to see a failure with or without MPI.

Also, if the CUDA driver was being found, I’d expect another message say something like “Rebuild with -gpu=ccXX to use device 0”, i.e. it will tell you which binary target is needed for this device.

Again I can’t be sure, but everything I’m seeing thus far points to some environmental issue with your mpirun causing the CUDA driver not to be found.

What MPI are you using? Your own or one of the versions we ship with the compiler?

I am using the MPI target from CMake, MPI::MPI_Fortran, it seems that it’s choosing the right one, judging from the output of ldd

➜  arborescence-compil git:(stdpar) ✗ ldd build_nvhpc_cudss_doconcurrent/Release/bin/forward_acoustic_hdg.out                
        linux-vdso.so.1 (0x000078efdf78b000)
        libnvhpcmanaux.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvhpcmanaux.so (0x000078efdf400000)
        libscalapack_lp64.so.2 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libscalapack_lp64.so.2 (0x000078efdec00000)
        liblapack_lp64.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/liblapack_lp64.so.0 (0x000078efdde00000)
        libblas_lp64.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libblas_lp64.so.0 (0x000078efdbe00000)
        liblapack.so.3 => /lib/x86_64-linux-gnu/liblapack.so.3 (0x000078efdb600000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x000078efdf743000)
        libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 (0x000078efdf72f000)
        liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x000078efdf6ef000)
        libblas.so.3 => /lib/x86_64-linux-gnu/libblas.so.3 (0x000078efdf665000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000078efdf309000)
        libcudart.so.12 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/cuda/lib64/libcudart.so.12 (0x000078efdb200000)
        libcublas.so.12 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/math_libs/lib64/libcublas.so.12 (0x000078efd4000000)
        libcublasLt.so.12 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/math_libs/lib64/libcublasLt.so.12 (0x000078efa1c00000)
        libcusparse.so.12 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/math_libs/lib64/libcusparse.so.12 (0x000078ef8a800000)
        libnvJitLink.so.12 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/cuda/lib64/libnvJitLink.so.12 (0x000078ef84c00000)
        libmpi.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libmpi.so.40 (0x000078ef84800000)
        libnvomp.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvomp.so (0x000078ef83600000)
        libnvhpcatm.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvhpcatm.so (0x000078ef83200000)
        libatomic.so.1 => /lib/x86_64-linux-gnu/libatomic.so.1 (0x000078efdf656000)
        libnvcpumath.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvcpumath.so (0x000078ef82c00000)
        libnvc.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvc.so (0x000078ef82800000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000078ef82400000)
        /lib64/ld-linux-x86-64.so.2 (0x000078efdf78d000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000078efdf627000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000078ef82000000)
        libmpi_usempif08.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libmpi_usempif08.so.40 (0x000078ef81c00000)
        libmpi_usempi_ignore_tkr.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libmpi_usempi_ignore_tkr.so.40 (0x000078ef81800000)
        libmpi_mpifh.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libmpi_mpifh.so.40 (0x000078ef81400000)
        libacchost.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libacchost.so (0x000078ef81000000)
        libaccdevaux.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libaccdevaux.so (0x000078ef80c00000)
        libaccdevice.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libaccdevice.so (0x000078ef80800000)
        libnvhpcman.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvhpcman.so (0x000078ef80400000)
        libcudadevice.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libcudadevice.so (0x000078ef80000000)
        libnvf.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvf.so (0x000078ef7f800000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000078efdf61e000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000078efdf617000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000078efdf612000)
        libopenblas.so.0 => /lib/x86_64-linux-gnu/libopenblas.so.0 (0x000078ef7c6f8000)
        libmvec.so.1 => /lib/x86_64-linux-gnu/libmvec.so.1 (0x000078efdeb06000)
        libgfortran.so.5 => /lib/x86_64-linux-gnu/libgfortran.so.5 (0x000078ef7c200000)
        libopen-rte.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libopen-rte.so.40 (0x000078ef7be00000)
        libopen-pal.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libopen-pal.so.40 (0x000078ef7ba00000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x000078efdf60b000)
➜  arborescence-compil git:(stdpar) ✗ which mpirun
/opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/mpi/bin/mpirun

while the minimal example, which works, shows

➜  arborescence-compil git:(stdpar) ✗ ldd a.out                                                                                                 
        linux-vdso.so.1 (0x000074aaefe0a000)
        libnvhpcmanaux.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvhpcmanaux.so (0x000074aaefc00000)
        libmpi_usempif08.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libmpi_usempif08.so.40 (0x000074aaef800000)
        libmpi_usempi_ignore_tkr.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libmpi_usempi_ignore_tkr.so.40 (0x000074aaef400000)
        libmpi_mpifh.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libmpi_mpifh.so.40 (0x000074aaef000000)
        libmpi.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libmpi.so.40 (0x000074aaeec00000)
        libacchost.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libacchost.so (0x000074aaee800000)
        libaccdevaux.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libaccdevaux.so (0x000074aaee400000)
        libaccdevice.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libaccdevice.so (0x000074aaee000000)
        libnvhpcman.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvhpcman.so (0x000074aaedc00000)
        libcudadevice.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libcudadevice.so (0x000074aaed800000)
        libnvf.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvf.so (0x000074aaed000000)
        libnvomp.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvomp.so (0x000074aaebe00000)
        libnvhpcatm.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvhpcatm.so (0x000074aaeba00000)
        libnvcpumath.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvcpumath.so (0x000074aaeb400000)
        libnvc.so => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/compilers/lib/libnvc.so (0x000074aaeb000000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074aaeac00000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000074aaefbad000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000074aaefab6000)
        /lib64/ld-linux-x86-64.so.2 (0x000074aaefe0c000)
        libopen-rte.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libopen-rte.so.40 (0x000074aaea800000)
        libopen-pal.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib/libopen-pal.so.40 (0x000074aaea400000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x000074aaefaaf000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x000074aaefa91000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000074aaefa8c000)
        libatomic.so.1 => /lib/x86_64-linux-gnu/libatomic.so.1 (0x000074aaefa7f000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000074aaefa7a000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000074aaefa75000)

I have also tried running with LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libcuda.so.1" mpirun -np 1 ./exec just to be sure but again the minimal example works while our application fails at runtime

Thanks, I think that rules out the possibility that the driver isn’t getting found. Unfortunately, this means it’s something I’ve not seen before so don’t know what’s going on.

Are you able to share you’re full project? If not publicly, feel free to direct message by following the link on my user name and select the “message” button. We can then arrange how you can share it.

Hello, I managed to isolate the problem to a dependency, the strange thing is that it fails even if the dependency is not used, only linked. More specifically

# CMakeLists.txt
cmake_minimum_required(VERSION 3.20...4.0)

project(hawen LANGUAGES Fortran C)

include(FetchContent)

set(CMAKE_INSTALL_PREFIX ${CMAKE_CURRENT_BINARY_DIR}/local)
set(BUILD_SINGLE off)
set(BUILD_DOUBLE on)
set(BUILD_COMPLEX off)
set(BUILD_COMPLEX16 off)
set(MUMPS_parallel on)

FetchContent_Declare(MUMPSupstream
    GIT_REPOSITORY https://github.com/scivision/mumps.git
    GIT_TAG 2e396fed1d83857b5da86878fbe2df85bd3a0449
)

FetchContent_MakeAvailable(MUMPSupstream)

add_executable(main main.f90)
target_link_libraries(main PRIVATE MUMPS::MUMPS)
! main.f90
program main
    implicit none

    integer :: i, n
    real :: a, x(100), y(100)
    n = 100
    a = 2.0
    x = [(real(i), i=1,n)]
    y = [(real(i), i=1,n)]
    do concurrent (i = 1:n)
        y(i) = y(i) + a*x(i)
    enddo
end program main

and then compiled and run with

export FFLAGS="stdpar=gpu"
cmake -B build
cmake --build build
./build/main

will fail with

Current file:     /home/eduard/mumps-test/main.f90
        function: main
        line:     10
This file was compiled: -acc=gpu -gpu=cc80 -gpu=cc86

Even though the library isn’t used in the code, changing in the project section the languages to

...
project(hawen LANGUAGES Fortran)
...

Fixes the issue, for some reason, do you have any idea why that could be, at a first glance? I will signal it to the maintainers

Thanks! With this recipe, I was able to reproduce the error.

It looks like CMake is implicitly adding our runtime libraries to the end of the link line which is causing the error.

A bit background. The compiler implicitly includes an initialization object, in this case “acc_init_link_cuda.o”, which performs the device initialization and registering the CUDA kernels. The exact object used depends on the target (host, multicore, cuda), and if unified memory is used or not.

The object contains weak references which are also in our runtime. This way the references can always be added, but if you’re compiling just for the host (i.e. no STDPAR, OpenACC, OpenMP), then they get resolved by the runtime instead of the object. So the order on the link line is important in that this object needs to come after the runtime libraries (in particular -lnvc), so the correct reference is used.

Here, CMake is explicitly adding “-lnvc” at the end of the link line, which in turn causes the initialization routines to be resolved to libnvc instead of the device initialization object. Hence when the code encounters the device code, it gives this error since the CUDA driver wasn’t loaded (which is done in the init).

I’m not an expert in using CMake, so unfortunately don’t know how to fix this. Though, I do have someone I can ask so will see what he thinks.

I think I found the problem without needing to ask.

In looking at this report on Kitware which has the same issue: NVHPC: Multiple languages + OpenACC prevents GPU usage (#25644) · Issues · CMake / CMake · GitLab

The problem wasn’t with CMake but rather one of the modules.

In your modules, it’s explicitly adding the library:

./CMakeFiles/3.28.3/CMakeCCompiler.cmake:set(CMAKE_C_IMPLICIT_LINK_LIBRARIES "nvf;nvomp;dl;nvhpcatm;atomic;pthread;nvcpumath;nsnvc;nvc;rt;pthread;gcc;c;gcc_s;m")

Though this does look like CMake itself, so not sure why it’s getting added.

Thank you for your help, that was already very insightful