Behaviour of OpenMP target maps with Fortran arrays

Hi,

I’m trying to understand why this basic fortran program with openmp offloading isn’t producing the results I expect.

program main

        implicit none
        integer, parameter:: ma=10
        double precision:: a(ma), b(ma)
        integer:: i

        a(:) = 0.d0
        b(:) = [(i, i=1,ma)]

        write(*,"(5(f14.7,1x))") b

        !$omp target enter data map(to:a, b)

        !$omp target loop map(from:a)
        do i = 1, ma
                a(i) = b(i)
        end do

        !$omp target exit data map(from:a)

        write(*,"(5(f14.7,1x))") a
end program main

The program is compiled with nvfortran -mp gpu test.f90 on a laptop gtx1650 through WSL2.

I expected that the a array would be copied back to the host with all the values from b i.e., numbers 1 to ma, and the write statement at the end would print those numbers, but instead I get zeros.

Could I get some help understanding what I’m doing wrong here? A similar program with similar OpenACC directives works in the way I expect.

Thanks

1 Like

Hi edoy,

For good or bad, the code runs correctly for so I’m not sure what’s going on. Can you let me know what compiler version you’re using? Also, let’s add the flag “-Minfo=mp” to see the compiler feedback message are telling us and the environment variable “NV_ACC_NOTIFY=2” so the runtime will tells when it copies data.

Note that I assume the “-mp gpu” is typo. Without the missing “=”, I’d expect you’d get a linker error as “gpu” would be passed to the linker.

% nvfortran -mp=gpu test.f90 -Minfo=mp; a.out
main:
     13, Generating target enter data map(to: a(:),b(:))
     15, !$omp target loop
         15, Generating "nvkernel_MAIN__F1L15_2" GPU kernel
             Generating NVIDIA GPU code
           16, Loop parallelized across teams, threads(32) ! blockidx%x threadidx%x
         15, Generating Multicore code
           16, Loop parallelized across threads
     15, Generating implicit map(tofrom:b(:))
         Generating map(from:a(:))
     20, Generating target exit data map(from: a(:))
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
% setenv NV_ACC_NOTIFY 2
% a.out
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
upload CUDA data  file=/local/home/mcolgrove/test.f90 function=main line=13 device=0 threadid=1 variable=a(:) bytes=80
upload CUDA data  file=/local/home/mcolgrove/test.f90 function=main line=13 device=0 threadid=1 variable=b(:) bytes=80
download CUDA data  file=/local/home/mcolgrove/test.f90 function=main line=20 device=0 threadid=1 variable=a(:) bytes=80
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000

Hi Mat,
Yes, you’re right, I compiled with -mp=gpu. Following your instructions, I get:

$ /opt/nvidia/hpc_sdk/Linux_x86_64/24.11/compilers/bin/nvfortran -target=gpu -mp=gpu -Minfo=all test.f90
main:
     13, Generating target enter data map(to: a(:),b(:))
     15, !$omp target loop
         15, Generating "nvkernel_MAIN__F1L15_2" GPU kernel
             Generating NVIDIA GPU code
           16, Loop parallelized across teams, threads(32) ! blockidx%x threadidx%x
         15, Generating Multicore code
           16, Loop parallelized across threads
     15, Generating implicit map(tofrom:b(:))
         Generating map(from:a(:))
     16, Recognized memory copy idiom
     20, Generating target exit data map(from: a(:))

$ NV_ACC_NOTIFY=2 ./a.out
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
upload CUDA data  file=/home/edwardy/test.f90 function=main line=13 device=0 threadid=1 variable=a(:) bytes=80
upload CUDA data  file=/home/edwardy/test.f90 function=main line=13 device=0 threadid=1 variable=b(:) bytes=80
download CUDA data  file=/home/edwardy/test.f90 function=main line=20 device=0 threadid=1 variable=a(:) bytes=80
     0.0000000      0.0000000      0.0000000      0.0000000      0.0000000
     0.0000000      0.0000000      0.0000000      0.0000000      0.0000000

which, I guess, indicates that the download to a is happening as expected, but for some reason not being reflected in the output?

Some new information: I tried compiling using the latest nvhpc container and it worked! So this probably means there’ something wrong with my install and/or environment. Previously, I installed nvhpc using the deb package.

Good, I’m glad the updated container works for you.

I updated the nvhpc deb package to 25.1 (previously using 24.11) but it’s still not working. I can work with the container for now, but do you know if there’s anything else could be wrong with my local environment?

Hmm, my one thought is that it’s not the data movement that’s the problem but possibly the kernel itself is silently failing.

WSL2 puts the CUDA Driver (libcuda.so) in a non-default location so users often need have to set LD_LIBRARY_PATH so the binary can find it. Now I’d expect a different error if it couldn’t be found, but I’m wondering if instead of the real libcuda.so, your environment is picking up a non-functional stub version of the library?

It’s a long shot, but you can confirm by running strace on the binary and searching the output of libcuda to see which one is being used.

% strace a.out > & s.log
% grep libcuda s.log
openat(AT_FDCWD, "/usr/lib64/libcudadevice.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/proj/nv/Linux_x86_64/dev/compilers/lib/libcudadevice.so", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proj/nv/Linux_x86_64/dev/compilers/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib64/libcuda.so", O_RDONLY|O_CLOEXEC) = 3

I think it’s picking up the right one:

openat(AT_FDCWD, "/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libcudadevice.so", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/glibc-hwcaps/x86-64-v3/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/glibc-hwcaps/x86-64-v2/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/glibc-hwcaps/x86-64-v3/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/glibc-hwcaps/x86-64-v2/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/wsl/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/wsl/drivers/nvdmi.inf_amd64_ff6ae5a857dd670b/libcuda.so.1.1", O_RDONLY|O_CLOEXEC) = 4
openat(AT_FDCWD, "/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

other gpu fortran stuff works e.g. cudafortran code and openacc annotated code works ok. The wrong result only seems to happen when compiling for openmp offloading.

Yes, that appears to be the correct driver. Some other things to try:

  • Set the environment variable “OMP_TARGET_OFFLOAD=MANDATORY” to force offload
  • Run the binary with “compute-sanitizer a.out” to check if there are any silent kernel errors
  • Instead of the “loop” directive, try " !$omp target teams distribute parallel do map(from:a)"
  • Changing loop to teams distribute parallel do results in the same behaviour.
  • running with OMP_TARGET_OFFLOAD=MANDATORY causes an error: Fatal error: Could not run target region on device 0, execution terminated.
  • running with compute sanitizer and OMP_TARGET_OFFLOAD=MANDATORY:
========= COMPUTE-SANITIZER
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
Fatal error: Could not run target region on device 0, execution terminated.
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

With compute-sanitizer but without the environment variable results in no errors and program completes normally, but still printing wrong values.

OpenMP Offload, per the standard, has a host fallback. In other words, if the code can’t execute on the device, it instead runs on the host. That’s what’s happening here. The kernel is running on the host, but the unstructured data directives are still active so the host values of “a” are getting over with the device values.

Setting “MANDATORY” forces offload and hence the error. Now the question is why it’s erroring?

What do we know so far:

  1. The correct CUDA driver is getting picked up
  2. The code works fine in the container, so it’s likely an environment issue
  3. You are running on an older device, but if that were an issue, then the container environment would fail as well, so unlikely to be the problem.
  4. The Windows CUDA driver uses WDDM mode for GTX devices which doesn’t allow for compute, only graphics. Compute is only available with TCC mode. This could be a cause, but since the OpenACC and CUDA Fortran versions work, it’s less likely since these should fail as well if this were the case.

Can you run “nvidia-smi” from your command line?

This will show us what mode the driver is in. Also, I’m wondering if the device 0 that it’s picking up isn’t the GTX1650 card, but a different device which can’t run compute. The device enumeration could differ between OpenMP and OpenACC (depends on how the CUDA driver presents it to each runtime).

If indeed nvidia-smi shows a second device, let’s try setting the environment variable “CUDA_VISIBLE_DEVICES” to the device number of the GTX1650 shown in the output.

Here’s the nvidia-smi output when running from cmd.exe in Windows:

Fri Feb  7 08:15:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 572.16                 Driver Version: 572.16         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1650      WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8              4W /   50W |     220MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3880    C+G   ...5n1h2txyewy\TextInputHost.exe      N/A      |
|    0   N/A  N/A            8092    C+G   ...ll Peripheral Manager\DPM.exe      N/A      |
|    0   N/A  N/A            8508    C+G   ...64__zpdnekdrzrea0\Spotify.exe      N/A      |
|    0   N/A  N/A           12728    C+G   ...ocal\Programs\zulip\Zulip.exe      N/A      |
|    0   N/A  N/A           14812    C+G   ...ams\breaktimer\BreakTimer.exe      N/A      |
|    0   N/A  N/A           15548    C+G   ...8bbwe\PhoneExperienceHost.exe      N/A      |
|    0   N/A  N/A           16192    C+G   ...m\103.0.3.0\GoogleDriveFS.exe      N/A      |
|    0   N/A  N/A           17284    C+G   ...h_cw5n1h2txyewy\SearchApp.exe      N/A      |
|    0   N/A  N/A           22876    C+G   ...App_cw5n1h2txyewy\LockApp.exe      N/A      |
|    0   N/A  N/A           23456    C+G   ...ekyb3d8bbwe\CalculatorApp.exe      N/A      |
+-----------------------------------------------------------------------------------------+

If I try setting it to TCC:

$ nvidia-smi -dm 1
Unable to set driver model for GPU 00000000:01:00.0: Unknown Error

and setting it to WDDM:

$ nvidia-smi -dm 0
Driver model is already set to WDDM for GPU 00000000:01:00.0.

WDDM could explain the problem, though I haven’t worked on Windows in many years. At least then you couldn’t do compute through WDDM, but this guide to using GPU on WSL2 talks about GPU pass through with WDDM and in different section about using CUDA. This implies to me that it might be ok now with Pascal or newer devices but I’m not really sure.

Wish I could be more helpful. but NVHPC SDK doesn’t officially support WSL2, and by that I mean we don’t test the compilers on WSL2. I’ve had many users use it without issue (it is just Linux), but can’t say for sure there aren’t problems, especially when there’s interaction with the Windows side.

I’d say you stick with OpenACC for now, though before you do, double check that it is really using the device and not falling back to the host as well. Can you try run the OpenACC version and set “NV_ACC_NOTIFY=3” before running? This will show both the data movement and kernel launches. If nothing prints, then it’s running on the host.

Thanks Mat,

Looks like similar OpenACC code does run on the GPU.

The code:

program main

        implicit none
        integer, parameter:: ma=10
        double precision:: a(ma), b(ma)
        integer:: i

        a(:) = 0.d0
        b(:) = [(i, i=1,ma)]

        write(*,"(5(f14.7,1x))") b

        !$acc parallel loop
        do i = 1, ma
                a(i) = b(i)
        end do

        write(*,"(5(f14.7,1x))") a
end program main

Commands to compile and run:

nvfortran -target=gpu -mp=gpu -Minfo=all -acc test.f90
NV_ACC_NOTIFY=3 ./a.out

output:

     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000
upload CUDA data  file=/home/edwardy/test.f90 function=main line=16 device=0 threadid=1 variable=b(:) bytes=80
launch CUDA kernel  file=/home/edwardy/test.f90 function=main line=16 device=0 threadid=1 num_gangs=1 num_workers=1 vector_length=32 grid=1 block=32
download CUDA data  file=/home/edwardy/test.f90 function=main line=23 device=0 threadid=1 variable=a(:) bytes=80
     1.0000000      2.0000000      3.0000000      4.0000000      5.0000000
     6.0000000      7.0000000      8.0000000      9.0000000     10.0000000

I’ve managed to get access to machines with native Linux, and I’m using OpenMP offload through them. But it is a shame I can’t do the dev locally :(.