Behaviour of OpenMP target maps with Fortran arrays

edoy · February 4, 2025, 6:13am

Hi,

I’m trying to understand why this basic fortran program with openmp offloading isn’t producing the results I expect.

program main

        implicit none
        integer, parameter:: ma=10
        double precision:: a(ma), b(ma)
        integer:: i

        a(:) = 0.d0
        b(:) = [(i, i=1,ma)]

        write(*,"(5(f14.7,1x))") b

        !$omp target enter data map(to:a, b)

        !$omp target loop map(from:a)
        do i = 1, ma
                a(i) = b(i)
        end do

        !$omp target exit data map(from:a)

        write(*,"(5(f14.7,1x))") a
end program main

The program is compiled with nvfortran -mp gpu test.f90 on a laptop gtx1650 through WSL2.

I expected that the a array would be copied back to the host with all the values from b i.e., numbers 1 to ma, and the write statement at the end would print those numbers, but instead I get zeros.

Could I get some help understanding what I’m doing wrong here? A similar program with similar OpenACC directives works in the way I expect.

Thanks

MatColgrove · February 4, 2025, 4:44pm

Hi edoy,

For good or bad, the code runs correctly for so I’m not sure what’s going on. Can you let me know what compiler version you’re using? Also, let’s add the flag “-Minfo=mp” to see the compiler feedback message are telling us and the environment variable “NV_ACC_NOTIFY=2” so the runtime will tells when it copies data.

Note that I assume the “-mp gpu” is typo. Without the missing “=”, I’d expect you’d get a linker error as “gpu” would be passed to the linker.

% nvfortran -mp=gpu test.f90 -Minfo=mp; a.out
main:
     13, Generating target enter data map(to: a(:),b(:))
     15, !$omp target loop
         15, Generating "nvkernel_MAIN__F1L15_2" GPU kernel
             Generating NVIDIA GPU code
           16, Loop parallelized across teams, threads(32) ! blockidx%x threadidx%x
         15, Generating Multicore code
           16, Loop parallelized across threads
     15, Generating implicit map(tofrom:b(:))
         Generating map(from:a(:))
     20, Generating target exit data map(from: a(:))
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
% setenv NV_ACC_NOTIFY 2
% a.out
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
upload CUDA data  file=/local/home/mcolgrove/test.f90 function=main line=13 device=0 threadid=1 variable=a(:) bytes=80
upload CUDA data  file=/local/home/mcolgrove/test.f90 function=main line=13 device=0 threadid=1 variable=b(:) bytes=80
download CUDA data  file=/local/home/mcolgrove/test.f90 function=main line=20 device=0 threadid=1 variable=a(:) bytes=80
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000

edoy · February 4, 2025, 9:28pm

Hi Mat,
Yes, you’re right, I compiled with -mp=gpu. Following your instructions, I get:

$ /opt/nvidia/hpc_sdk/Linux_x86_64/24.11/compilers/bin/nvfortran -target=gpu -mp=gpu -Minfo=all test.f90
main:
     13, Generating target enter data map(to: a(:),b(:))
     15, !$omp target loop
         15, Generating "nvkernel_MAIN__F1L15_2" GPU kernel
             Generating NVIDIA GPU code
           16, Loop parallelized across teams, threads(32) ! blockidx%x threadidx%x
         15, Generating Multicore code
           16, Loop parallelized across threads
     15, Generating implicit map(tofrom:b(:))
         Generating map(from:a(:))
     16, Recognized memory copy idiom
     20, Generating target exit data map(from: a(:))

$ NV_ACC_NOTIFY=2 ./a.out
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
upload CUDA data  file=/home/edwardy/test.f90 function=main line=13 device=0 threadid=1 variable=a(:) bytes=80
upload CUDA data  file=/home/edwardy/test.f90 function=main line=13 device=0 threadid=1 variable=b(:) bytes=80
download CUDA data  file=/home/edwardy/test.f90 function=main line=20 device=0 threadid=1 variable=a(:) bytes=80
     0.0000000      0.0000000      0.0000000      0.0000000      0.0000000
     0.0000000      0.0000000      0.0000000      0.0000000      0.0000000

which, I guess, indicates that the download to a is happening as expected, but for some reason not being reflected in the output?

Some new information: I tried compiling using the latest nvhpc container and it worked! So this probably means there’ something wrong with my install and/or environment. Previously, I installed nvhpc using the deb package.

MatColgrove · February 4, 2025, 10:03pm

Good, I’m glad the updated container works for you.

edoy · February 4, 2025, 10:38pm

I updated the nvhpc deb package to 25.1 (previously using 24.11) but it’s still not working. I can work with the container for now, but do you know if there’s anything else could be wrong with my local environment?

MatColgrove · February 4, 2025, 10:52pm

Hmm, my one thought is that it’s not the data movement that’s the problem but possibly the kernel itself is silently failing.

WSL2 puts the CUDA Driver (libcuda.so) in a non-default location so users often need have to set LD_LIBRARY_PATH so the binary can find it. Now I’d expect a different error if it couldn’t be found, but I’m wondering if instead of the real libcuda.so, your environment is picking up a non-functional stub version of the library?

It’s a long shot, but you can confirm by running strace on the binary and searching the output of libcuda to see which one is being used.

% strace a.out > & s.log
% grep libcuda s.log
openat(AT_FDCWD, "/usr/lib64/libcudadevice.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/proj/nv/Linux_x86_64/dev/compilers/lib/libcudadevice.so", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proj/nv/Linux_x86_64/dev/compilers/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib64/libcuda.so", O_RDONLY|O_CLOEXEC) = 3

edoy · February 4, 2025, 11:09pm

I think it’s picking up the right one:

openat(AT_FDCWD, "/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libcudadevice.so", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/glibc-hwcaps/x86-64-v3/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/glibc-hwcaps/x86-64-v2/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/glibc-hwcaps/x86-64-v3/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/glibc-hwcaps/x86-64-v2/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/wsl/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/wsl/drivers/nvdmi.inf_amd64_ff6ae5a857dd670b/libcuda.so.1.1", O_RDONLY|O_CLOEXEC) = 4
openat(AT_FDCWD, "/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

other gpu fortran stuff works e.g. cudafortran code and openacc annotated code works ok. The wrong result only seems to happen when compiling for openmp offloading.

MatColgrove · February 5, 2025, 5:46pm

Yes, that appears to be the correct driver. Some other things to try:

Set the environment variable “OMP_TARGET_OFFLOAD=MANDATORY” to force offload
Run the binary with “compute-sanitizer a.out” to check if there are any silent kernel errors
Instead of the “loop” directive, try " !$omp target teams distribute parallel do map(from:a)"

edoy · February 6, 2025, 12:31am

Changing loop to teams distribute parallel do results in the same behaviour.
running with OMP_TARGET_OFFLOAD=MANDATORY causes an error: Fatal error: Could not run target region on device 0, execution terminated.
running with compute sanitizer and OMP_TARGET_OFFLOAD=MANDATORY:

========= COMPUTE-SANITIZER
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
Fatal error: Could not run target region on device 0, execution terminated.
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

With compute-sanitizer but without the environment variable results in no errors and program completes normally, but still printing wrong values.

MatColgrove · February 6, 2025, 4:24pm

OpenMP Offload, per the standard, has a host fallback. In other words, if the code can’t execute on the device, it instead runs on the host. That’s what’s happening here. The kernel is running on the host, but the unstructured data directives are still active so the host values of “a” are getting over with the device values.

Setting “MANDATORY” forces offload and hence the error. Now the question is why it’s erroring?

What do we know so far:

The correct CUDA driver is getting picked up
The code works fine in the container, so it’s likely an environment issue
You are running on an older device, but if that were an issue, then the container environment would fail as well, so unlikely to be the problem.
The Windows CUDA driver uses WDDM mode for GTX devices which doesn’t allow for compute, only graphics. Compute is only available with TCC mode. This could be a cause, but since the OpenACC and CUDA Fortran versions work, it’s less likely since these should fail as well if this were the case.

Can you run “nvidia-smi” from your command line?

This will show us what mode the driver is in. Also, I’m wondering if the device 0 that it’s picking up isn’t the GTX1650 card, but a different device which can’t run compute. The device enumeration could differ between OpenMP and OpenACC (depends on how the CUDA driver presents it to each runtime).

If indeed nvidia-smi shows a second device, let’s try setting the environment variable “CUDA_VISIBLE_DEVICES” to the device number of the GTX1650 shown in the output.

edoy · February 6, 2025, 9:15pm

Here’s the nvidia-smi output when running from cmd.exe in Windows:

Fri Feb  7 08:15:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 572.16                 Driver Version: 572.16         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1650      WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8              4W /   50W |     220MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3880    C+G   ...5n1h2txyewy\TextInputHost.exe      N/A      |
|    0   N/A  N/A            8092    C+G   ...ll Peripheral Manager\DPM.exe      N/A      |
|    0   N/A  N/A            8508    C+G   ...64__zpdnekdrzrea0\Spotify.exe      N/A      |
|    0   N/A  N/A           12728    C+G   ...ocal\Programs\zulip\Zulip.exe      N/A      |
|    0   N/A  N/A           14812    C+G   ...ams\breaktimer\BreakTimer.exe      N/A      |
|    0   N/A  N/A           15548    C+G   ...8bbwe\PhoneExperienceHost.exe      N/A      |
|    0   N/A  N/A           16192    C+G   ...m\103.0.3.0\GoogleDriveFS.exe      N/A      |
|    0   N/A  N/A           17284    C+G   ...h_cw5n1h2txyewy\SearchApp.exe      N/A      |
|    0   N/A  N/A           22876    C+G   ...App_cw5n1h2txyewy\LockApp.exe      N/A      |
|    0   N/A  N/A           23456    C+G   ...ekyb3d8bbwe\CalculatorApp.exe      N/A      |
+-----------------------------------------------------------------------------------------+

If I try setting it to TCC:

$ nvidia-smi -dm 1
Unable to set driver model for GPU 00000000:01:00.0: Unknown Error

and setting it to WDDM:

$ nvidia-smi -dm 0
Driver model is already set to WDDM for GPU 00000000:01:00.0.

MatColgrove · February 6, 2025, 10:13pm

WDDM could explain the problem, though I haven’t worked on Windows in many years. At least then you couldn’t do compute through WDDM, but this guide to using GPU on WSL2 talks about GPU pass through with WDDM and in different section about using CUDA. This implies to me that it might be ok now with Pascal or newer devices but I’m not really sure.

Wish I could be more helpful. but NVHPC SDK doesn’t officially support WSL2, and by that I mean we don’t test the compilers on WSL2. I’ve had many users use it without issue (it is just Linux), but can’t say for sure there aren’t problems, especially when there’s interaction with the Windows side.

I’d say you stick with OpenACC for now, though before you do, double check that it is really using the device and not falling back to the host as well. Can you try run the OpenACC version and set “NV_ACC_NOTIFY=3” before running? This will show both the data movement and kernel launches. If nothing prints, then it’s running on the host.

edoy · February 11, 2025, 12:25am

Thanks Mat,

Looks like similar OpenACC code does run on the GPU.

The code:

program main

        implicit none
        integer, parameter:: ma=10
        double precision:: a(ma), b(ma)
        integer:: i

        a(:) = 0.d0
        b(:) = [(i, i=1,ma)]

        write(*,"(5(f14.7,1x))") b

        !$acc parallel loop
        do i = 1, ma
                a(i) = b(i)
        end do

        write(*,"(5(f14.7,1x))") a
end program main

Commands to compile and run:

nvfortran -target=gpu -mp=gpu -Minfo=all -acc test.f90
NV_ACC_NOTIFY=3 ./a.out

output:

     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
upload CUDA data  file=/home/edwardy/test.f90 function=main line=16 device=0 threadid=1 variable=b(:) bytes=80
launch CUDA kernel  file=/home/edwardy/test.f90 function=main line=16 device=0 threadid=1 num_gangs=1 num_workers=1 vector_length=32 grid=1 block=32
download CUDA data  file=/home/edwardy/test.f90 function=main line=23 device=0 threadid=1 variable=a(:) bytes=80
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000

I’ve managed to get access to machines with native Linux, and I’m using OpenMP offload through them. But it is a shame I can’t do the dev locally :(.

Topic		Replies	Views
Nvfortran error nvc, nvc++ and nvfortran	39	3341	January 17, 2024
OpenACC Region: Command exited with non-zero status 1 nvc, nvc++ and nvfortran cuda	21	1852	October 14, 2021
cuda fortran questions Legacy PGI Compilers	10	10944	July 27, 2012
ENOMEM when running CUDA sample on host GPU where another GPU is passed through via IOMMU/vfio-pci Linux	1	768	May 19, 2019
[BUG] target-docker-container running cuda-samples require unintended extra permission DRIVE AGX Orin General docker	12	1511	May 30, 2023
How to use OpenMP map directive to map dynamic array inside a struct/class to the GPU? nvc, nvc++ and nvfortran	16	104	January 17, 2025
deviceQuery passes and then fails CUDA Setup and Installation	4	2144	July 6, 2016
Openacc, command exited with non_zero status 1 nvc, nvc++ and nvfortran cuda , ubuntu	19	1337	October 10, 2021
Runtime problem with PGFORTRAN Linux	40	1157	October 7, 2019
Compiling Fortran code to run on rtx 4090 Legacy PGI Compilers	29	2229	July 26, 2024

Behaviour of OpenMP target maps with Fortran arrays

Related topics