NVFortran openmp offloading to multiple GPUs

I am exploring different options for a parallelising a piece of code. As an example I am using a little matrix-matrix multiplication routine. One of the options I am trying is to use OpenMP to offload the multiplication to multiple GPUs. I have a piece of code that works in the sense that it uses multiple GPUs and it produces the correct answer, but it doesn’t use multiple GPUs at the same time. Instead it uses one GPU after the other. I don’t seem to be able to do the offloading in such a way that multiple GPUs participate simultaneously. I am assuming I am doing something wrong and I am hoping that someone could give some pointers as to how to fix this.

The subroutine in question, living in calc_mxm.F90, is

      subroutine calc_mxm(a,b,c,nsize)
!
!     Calculate C = C + A * B
!
      use omp_lib
      implicit none
      integer, parameter :: dp = selected_real_kind(15,3)
!
!     Input/Output Arguments
      integer, intent(in) :: nsize
      real(kind=dp), intent(in)    :: a(nsize,nsize)
      real(kind=dp), intent(in)    :: b(nsize,nsize)
      real(kind=dp), intent(inout) :: c(nsize,nsize)
!
!     Local variables
      integer :: k_dev     ! Loop over devices
      integer :: ijlo, ijhi
      integer :: ii, jj, kk
      integer :: num_per_thread
      integer :: max_threads
      integer :: num_devices
      num_devices = omp_get_num_devices()
      write(*,*)" num_devices = ",num_devices
      max_threads = omp_get_max_threads()
!
      num_per_thread = (nsize-1+num_devices)/num_devices
      do k_dev = 0, num_devices-1
        ijlo = k_dev*num_per_thread+1
        ijhi = min((k_dev+1)*num_per_thread,nsize)
!$omp   target device(k_dev) map(tofrom: c(:,ijlo:ijhi)) map(to: a, b(:,ijlo:ijhi), ijlo, ijhi, nsize) nowait
!$omp   teams distribute parallel do collapse(2) default(shared) firstprivate(ijlo, ijhi, nsize) private(jj,ii,kk)
        do jj = ijlo, ijhi
          do ii = 1, nsize
            do kk = 1, nsize
              c(ii,jj) = c(ii,jj) + a(ii,kk)* &
                                    b(kk,jj)
            enddo
          enddo
        enddo
!$omp   end teams distribute parallel do
!$omp   end target
      end do
!$omp taskwait
!
      end subroutine calc_mxm

The program that uses this routine sits in mxm.F and reads

      program mxm
!
!     Simple program to comput C = A * B
!
      implicit none
      integer, parameter :: dp = selected_real_kind(15,3)
      integer :: nsize ! the dimension of the matrices
!
      logical :: overify
!
      real(kind=dp) :: c_min, c_max
!
      real(kind=dp), allocatable :: a(:,:)
      real(kind=dp), allocatable :: b(:,:)
      real(kind=dp), allocatable :: c(:,:)
!
      call read_size(nsize,overify)
      write(*,*)" size: ",nsize," verify: ",overify
!
!     Define the problem
!     If (overify) set C = -A*B
!
      allocate(a(nsize,nsize))
      allocate(b(nsize,nsize))
      allocate(c(nsize,nsize))
      call init_data(nsize,a,b,c,overify)
!
!     Calculate C = C + A*B
!
      call calc_mxm(a,b,c,nsize)
!
!     Check whether C = 0
!
      if (overify) then
        c_min = minval(c)
        c_max = maxval(c)
        write(*,*)"c min,max = ",c_min,c_max
      endif
      deallocate(c)
      deallocate(b)
      deallocate(a)
!
      end program mxm
!
!-----------------------------------------------------------------------
!
      subroutine read_size(nsize,overify)
      implicit none
      integer, intent(out) :: nsize
      logical, intent(out) :: overify
      character(len=12) :: arg
      integer :: narg
      integer :: ierror
!
      narg = command_argument_count()
      if (.not.(narg.eq.1.or.narg.eq.2)) then
        write(*,*)" The program expects 1 command line argument"
        write(*,*)" specifying the dimension of the matrices."
        write(*,*)" It got ",narg," arguments."
        stop 10
      endif
      call get_command_argument(1,arg)
      read(arg,"(i12)")nsize
      overify = .false.
      if (narg.eq.2) then
        call get_command_argument(2,arg)
        read(arg,"(l8)")overify
      endif
!
      end subroutine read_size
!
!-----------------------------------------------------------------------
!
      subroutine init_data(nsize,a_all,b_all,c_all,overify)
      implicit none
      integer, intent(in) :: nsize
      logical, intent(in) :: overify
      integer, parameter :: dp = selected_real_kind(15)
      real(kind=dp), intent(out) :: a_all(nsize,nsize)
      real(kind=dp), intent(out) :: b_all(nsize,nsize)
      real(kind=dp), intent(out) :: c_all(nsize,nsize)
!
      integer :: ierror
!
      c_all = 0.0_dp
      call random_number(a_all)
      call random_number(b_all)
      ! Convert random numbers to range [-1,1] using matrix notation
      a_all = 2.0_dp * a_all - 1.0_dp
      b_all = 2.0_dp * b_all - 1.0_dp
      if (overify) then
        c_all = -matmul(a_all,b_all)
      endif
      end subroutine init_data

I am compiling the code with the commands

nvfortran -Minstrument -Minfo -Mcache_align -c -mp=gpu -O2 -g mxm.F -o mxm.o -traceback -lnvhpcwrapnvtx
nvfortran -Minstrument -Minfo=mp -Mcache_align -c -mp=gpu -O2 -g calc_mxm.F90 -o calc_mxm.o -traceback -lnvhpcwrapnvtx
nvfortran -Minstrument -Minfo -Mcache_align -mp=gpu -O2 -g mxm.o calc_mxm.o -o mpi-mxm -traceback -lnvhpcwrapnvtx

The nvfortran compiler I am using is

$ nvfortran --version

nvfortran 25.7-0 64-bit target on x86-64 Linux -tp sapphirerapids 
NVIDIA Compilers and Tools
Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

I am running the code as

nsys profile --trace=openmp,cuda,nvtx ./mpi-mxm 1600 T
nsys profile --trace=openmp,cuda,nvtx ./mpi-mxm 16000

The first time doesn’t take long and verifies whether the results are correct. The second does much more compute and produces a sensible profile. The matrix dimensions are chosen so that every GPU operates on an integer number of cache lines to avoid any false data sharing.

About the GPUs nvidia-smi says:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               On  |   00000000:27:00.0 Off |                    0 |
| N/A   30C    P0             50W /  310W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 PCIe               On  |   00000000:38:00.0 Off |                    0 |
| N/A   29C    P0             50W /  310W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 PCIe               On  |   00000000:98:00.0 Off |                    0 |
| N/A   30C    P0             49W /  310W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 PCIe               On  |   00000000:C8:00.0 Off |                    0 |
| N/A   29C    P0             48W /  310W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

Any suggestions are greatly appreciated.

Thank you, Huub

The issue is that you are scheduling work serially.
If you add a simple

!$omp parallel do

before the loop on the devices

do k_dev = 0, num_devices-1

you will see concurrency:

Interesting. I did try this at one point as this seemed like the most reasonable approach. It would essentially be the same as when you use MPI but the context would now be a thread instead of an MPI process. For some reason I didn’t obtain the result you are seeing. Maybe, I messed something up. I’ll try this again.

In the meantime I have also found a different solution but that does not overlap the communication so nicely. The code for that is

      subroutine calc_mxm(a,b,c,nsize)
!
!     Calculate C = C + A * B
!
      use omp_lib
      implicit none
      integer, parameter :: dp = selected_real_kind(15,3)
!
!     Input/Output Arguments
      integer, intent(in) :: nsize
      real(kind=dp), intent(in)    :: a(nsize,nsize)
      real(kind=dp), intent(in)    :: b(nsize,nsize)
      real(kind=dp), intent(inout) :: c(nsize,nsize)
!
!     Local variables
      integer :: k_dev     ! Loop over devices
      integer :: ii, jj, kk
      integer :: num_per_thread
      integer :: num_devices
      integer, allocatable :: ijlo(:), ijhi(:)
      num_devices = omp_get_num_devices()
      write(*,*)" num_devices = ",num_devices
      allocate(ijlo(0:num_devices-1))
      allocate(ijhi(0:num_devices-1))
!
      num_per_thread = (nsize-1+num_devices)/num_devices
      do k_dev = 0, num_devices-1
        ijlo(k_dev) = k_dev*num_per_thread+1
        ijhi(k_dev) = min((k_dev+1)*num_per_thread,nsize)
      enddo
      do k_dev = 0, num_devices-1
!$omp   target enter data device(k_dev) map(to: c(:,ijlo(k_dev):ijhi(k_dev)), a, b(:,ijlo(k_dev):ijhi(k_dev)), ijlo, ijhi, nsize) nowait
      end do
!$omp taskwait
      do k_dev = 0, num_devices-1
!$omp   target device(k_dev) map(to: c(:,ijlo(k_dev):ijhi(k_dev)), a, b(:,ijlo(k_dev):ijhi(k_dev)), ijlo, ijhi, nsize)  nowait
!$omp   teams distribute parallel do collapse(2)
        do jj = ijlo(k_dev), ijhi(k_dev)
          do ii = 1, nsize
            do kk = 1, nsize
              c(ii,jj) = c(ii,jj) + a(ii,kk)* &
                                    b(kk,jj)
            enddo
          enddo
        enddo
!$omp   end teams distribute parallel do
!$omp   end target
      enddo
!$omp taskwait
      do k_dev = 0, num_devices-1
!$omp   target exit data device(k_dev) map(from: c(:,ijlo(k_dev):ijhi(k_dev))) nowait
      end do
!$omp taskwait
      deallocate(ijlo)
      deallocate(ijhi)
!
      end subroutine calc_mxm

With the performance looking like

As one can clearly see the communication with the GPU ends up happening serially which costs a lot of extra time. Thanks for the tip, let me try parallel threads again.

I have now tried using parallel do on the loop over devices. But as I saw before I am not getting the results you are seeing. What I am getting is something like this

The issue seems to be the map(tofrom: c(:,ijlo:ijhi)) clause (if I change it from tofrom to to then the compute kernels don’t get held back, but of course the results are wrong in that case as I don’t get the updates to c back). It seems that as soon as one of the kernels changes c the other threads get signalled that the data might be changing and they should therefore wait. Is there maybe an OpenMP setting or other configuration setting I could use to tell the program to relax its consistency checks and let me deal with any potential race conditions?

The code that produced the above performance characteristics is

      subroutine calc_mxm(a,b,c,nsize)
!
!     Calculate C = C + A * B
!
      use omp_lib
      implicit none
      integer, parameter :: dp = selected_real_kind(15,3)
!
!     Input/Output Arguments
      integer, intent(in) :: nsize
      real(kind=dp), intent(in)    :: a(nsize,nsize)
      real(kind=dp), intent(in)    :: b(nsize,nsize)
      real(kind=dp), intent(inout) :: c(nsize,nsize)
!
!     Local variables
      integer :: k_dev     ! Loop over devices
      integer :: ii, jj, kk
      integer :: num_per_thread
      integer :: num_devices
      integer :: max_threads
      integer :: ijlo, ijhi
      num_devices = omp_get_num_devices()
      max_threads = omp_get_max_threads()
      write(*,*)" num_devices = ",num_devices
      write(*,*)" max_threads = ",max_threads
      if (max_threads.lt.num_devices) then
        write(*,*)" ERROR: too few threads available for the number of devices!"
        stop 200
      endif
!
      num_per_thread = (nsize-1+num_devices)/num_devices
!$omp parallel do private(ijlo, ijhi)
      do k_dev = 0, num_devices-1
        ijlo = k_dev*num_per_thread+1
        ijhi = min((k_dev+1)*num_per_thread,nsize)
!$omp   target device(k_dev) map(tofrom: c(:,ijlo:ijhi)) map(to: a, b(:,ijlo:ijhi), ijlo, ijhi, nsize)
!$omp   teams distribute parallel do collapse(2)
        do jj = ijlo, ijhi
          do ii = 1, nsize
            do kk = 1, nsize
              c(ii,jj) = c(ii,jj) + a(ii,kk)* &
                                    b(kk,jj)
            enddo
          enddo
        enddo
!$omp   end teams distribute parallel do
!$omp   end target
      enddo
!
      end subroutine calc_mxm

Do you have both -mp and -mp=gp when you compile?

You should see this compiler feedback:
nvfortran -Minstrument -Minfo=mp -Mcache_align -c -mp -mp=gpu -O2 -g calc_mxm.F90 -o calc_mxm.o -traceback -lnvhpcwrapnvtx

calc_mxm:

 27, !$omp parallel

 31, !$omp target teams distribute parallel do

     31, Generating "nvkernel_calc_mxm__F1L27_1__F1L31_3" GPU kernel

 45, Taskwait

Compiling the code with both -mp and -mp=gpu did not change the performance characteristics, but I did see messages like the ones you showed. Although for me there is an additional line:

calc_mxm:
     34, !$omp parallel
     35, Loop not vectorized/parallelized: contains structure move
     39, !$omp target teams distribute parallel do
         39, Generating "nvkernel_calc_mxm__F1L34_1__F1L39_3" GPU kernel
     39, Generating map(to:b(:,ijlo:ijhi),a(:,:),ijhi) 
         Generating map(tofrom:c(:,ijlo:ijhi)) 
         Generating map(to:nsize,ijlo) 
nvfortran-Warning-No gpu detected, compiling for all supported compute capabilities. Debug information for device code is not yet available when combining Blackwell (cc100+) with earlier compute capabilities; the compiler has added '-gpu=nodebug' to allow your compilation to succeed.

In particular the message

     35, Loop not vectorized/parallelized: contains structure move

seems worrying. Note that the machine that I am using is an HPC cluster. Only the compute nodes have GPUs installed, the login nodes that I am compiling on don’t have GPUs. I also cannot compile on the compute nodes right now. Those nodes are diskless machines and the OS image doesn’t have libatomic.so.1.2.0 installed, but the linker explicitly wants /usr/lib64/libatomic.so.1.2.0. So I cannot even provide the library in the compilation directory and pick that up. If I compile on the login nodes the linker is happy with just libatomic.so.1 which I have just copied into the compilation directory and that works fine.

In any case while during compilation no GPU is detected the compiler tries to generate code for a much broader range of architectures. I don’t know if that changes the way the compiler compiles code in any significant way.

Another thing I have tried is to change the code a little bit by adding a target data construct into this:

      subroutine calc_mxm(a,b,c,nsize)
!
!     Calculate C = C + A * B
!
      use omp_lib
      implicit none
      integer, parameter :: dp = selected_real_kind(15,3)
!
!     Input/Output Arguments
      integer, intent(in) :: nsize
      real(kind=dp), intent(in)    :: a(nsize,nsize)
      real(kind=dp), intent(in)    :: b(nsize,nsize)
      real(kind=dp), intent(inout) :: c(nsize,nsize)
!
!     Local variables
      integer :: k_dev     ! Loop over devices
      integer :: ii, jj, kk
      integer :: num_per_thread
      integer :: num_devices
      integer :: max_threads
      integer :: ijlo, ijhi
      num_devices = omp_get_num_devices()
      max_threads = omp_get_max_threads()
      write(*,*)" num_devices = ",num_devices
      write(*,*)" max_threads = ",max_threads
      if (max_threads.lt.num_devices) then
        write(*,*)" ERROR: too few threads available for the number of devices!"
        stop 200
      endif
!
      num_per_thread = (nsize-1+num_devices)/num_devices
!$omp parallel do private(ijlo, ijhi)
      do k_dev = 0, num_devices-1
        ijlo = k_dev*num_per_thread+1
        ijhi = min((k_dev+1)*num_per_thread,nsize)
!$omp   target enter data device(k_dev) map(to: c(:,ijlo:ijhi), a, b(:,ijlo:ijhi), ijlo, ijhi, nsize)
!$omp   target device(k_dev) map(to: c(:,ijlo:ijhi)) map(to: a, b(:,ijlo:ijhi), ijlo, ijhi, nsize)
!$omp   teams distribute parallel do collapse(2)
        do jj = ijlo, ijhi
          do ii = 1, nsize
            do kk = 1, nsize
              c(ii,jj) = c(ii,jj) + a(ii,kk)* &
                                    b(kk,jj)
            enddo
          enddo
        enddo
!$omp   end teams distribute parallel do
!$omp   end target
!$omp   target exit data device(k_dev) map(from: c(:,ijlo:ijhi))
      enddo
!
      end subroutine calc_mxm

And compile this with

nvfortran -Minstrument -Minfo=accel,inline,ftn,loop,lre,mp,opt,par,vect,stdpar -Mcache_align -c -mp -mp=gpu -O2 -g calc_mxm.F90 -o calc_mxm.o

But now I get an error message

nvfortran-Warning-No gpu detected, compiling for all supported compute capabilities. Debug information for device code is not yet available when combining Blackwell (cc100+) with earlier compute capabilities; the compiler has added '-gpu=nodebug' to allow your compilation to succeed.
NVFORTRAN-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected name table type (calc_mxm.F90)
NVFORTRAN/x86-64 Linux 25.7-0: compilation aborted

The Unexpected name table type seems a compiler issue.

Further experiments to avoid this message

led to the idea that the problem might be caused by moving array sections between the host and the device. Using a thread local allocatable array for each device indeed removes this message. However, the challenges with compute kernels serializing remain.

The code is:

      subroutine calc_mxm(a,b,c,nsize)
!
!     Calculate C = C + A * B
!
      use omp_lib
      implicit none
      integer, parameter :: dp = selected_real_kind(15,3)
!
!     Input/Output Arguments
      integer, intent(in) :: nsize
      real(kind=dp), intent(in)    :: a(nsize,nsize)
      real(kind=dp), intent(in)    :: b(nsize,nsize)
      real(kind=dp), intent(inout) :: c(nsize,nsize)
!
!     Local variables
      integer :: k_dev     ! Loop over devices
      integer :: ii, jj, kk
      integer :: num_per_thread
      integer :: num_devices
      integer :: max_threads
      integer :: ijlo, ijhi
      real(kind=dp), allocatable :: cc(:,:)
      num_devices = omp_get_num_devices()
      max_threads = omp_get_max_threads()
      write(*,*)" num_devices = ",num_devices
      write(*,*)" max_threads = ",max_threads
      if (max_threads.lt.num_devices) then
        write(*,*)" ERROR: too few threads available for the number of devices!"
        stop 200
      endif
!
      num_per_thread = (nsize-1+num_devices)/num_devices
!$omp parallel do private(ijlo, ijhi, cc)
      do k_dev = 0, num_devices-1
        ijlo = k_dev*num_per_thread+1
        ijhi = min((k_dev+1)*num_per_thread,nsize)
        ! Give each thread its own piece of memory to work with
        allocate(cc(1:nsize,ijlo:ijhi))
        cc = c(:,ijlo:ijhi)
        ! Now move entire arrays to the device to avoid the "contains structure move" issue
!$omp   target device(k_dev) map(tofrom: cc) map(to: a, b, ijlo, ijhi, nsize)
!$omp   teams distribute parallel do collapse(2)
        do jj = ijlo, ijhi
          do ii = 1, nsize
            do kk = 1, nsize
              cc(ii,jj) = cc(ii,jj) + a(ii,kk)* &
                                      b(kk,jj)
            enddo
          enddo
        enddo
!$omp   end teams distribute parallel do
!$omp   end target
        c(:,ijlo:ijhi) = cc
        deallocate(cc)
      enddo
!
      end subroutine calc_mxm

This code compiles generating the following compiler output:

$ nvfortran -Minstrument -Minfo=accel,inline,ftn,loop,lre,mp,opt,par,vect,stdpar -Mcache_align -c -mp -mp=gpu -O2 -g calc_mxm.F90 -o calc_mxm.o -traceback -lnvhpcwrapnvtx
nvfortran-Warning-No gpu detected, compiling for all supported compute capabilities. Debug information for device code is not yet available when combining Blackwell (cc100+) with earlier compute capabilities; the compiler has added '-gpu=nodebug' to allow your compilation to succeed.
calc_mxm:
     35, !$omp parallel
     41, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
     43, !$omp target teams distribute parallel do
         43, Generating "nvkernel_calc_mxm__F1L35_1__F1L43_3" GPU kernel
     43, Generating map(to:b(:,:),a(:,:),ijhi) 
         Generating map(tofrom:cc(:,:)) 
         Generating map(to:nsize,ijlo) 
     55, Recognized memory copy idiom

Which demonstrates the “contains structure move” message is gone. The nsys profile however is

showing that 2 GPU kernels are still being blocked while other kernels run. The strange thing is that because every thread has its own copy of cc the map(tofrom: cc) clause no longer operates on a common array, removing any reason for such serialisation at all.

You can specify the compute capability you are targeting with the option -gpu=ccXX, (V100 will be cc70, A100 will be cc80, H100 will be cc90). So, in your case it will be -gpu=cc90.

Cool! That means that I don’t have to consider submitting compilation jobs anymore, and it seems the code also got a bit faster.

I checked with 25.7 and H100, still getting overlap:

This is the calc_mxm.F90 I am using:

subroutine calc_mxm(a,b,c,nsize)
!
!     Calculate C = C + A * B
!
      use omp_lib
      implicit none
      integer, parameter :: dp = selected_real_kind(15,3)
!
!     Input/Output Arguments
      integer, intent(in) :: nsize
      real(kind=dp), intent(in)    :: a(nsize,nsize)
      real(kind=dp), intent(in)    :: b(nsize,nsize)
      real(kind=dp), intent(inout) :: c(nsize,nsize)
!
!     Local variables
      integer :: k_dev     ! Loop over devices
      integer :: ijlo, ijhi
      integer :: ii, jj, kk
      integer :: num_per_thread
      integer :: max_threads
      integer :: num_devices
      num_devices = omp_get_num_devices()
      write(*,*)" num_devices = ",num_devices
      max_threads = omp_get_max_threads()
!
      num_per_thread = (nsize-1+num_devices)/num_devices
      !$omp parallel do
      do k_dev = 0, num_devices-1
        ijlo = k_dev*num_per_thread+1
        ijhi = min((k_dev+1)*num_per_thread,nsize)
!$omp   target device(k_dev) map(tofrom: c(:,ijlo:ijhi)) map(to: a, b(:,ijlo:ijhi), ijlo, ijhi, nsize) nowait
!$omp   teams distribute parallel do collapse(2) default(shared) firstprivate(ijlo, ijhi, nsize) private(jj,ii,kk)
        do jj = ijlo, ijhi
          do ii = 1, nsize
            do kk = 1, nsize
              c(ii,jj) = c(ii,jj) + a(ii,kk)* &
                                    b(kk,jj)
            enddo
          enddo
        enddo
!$omp   end teams distribute parallel do
!$omp   end target
      end do
      !$omp end parallel do
!$omp taskwait
!
      end subroutine calc_mxm

Are you sure you have enough threads on the compute node? Depending how you schedule your job, you may have only few cores exposed on the node. You can check with “numactl -show”.

Sorry it took me a few days, but I have now tested this solution and I am now also getting good performance and correct results. This is great!

Thanks for your help!