I am exploring different options for a parallelising a piece of code. As an example I am using a little matrix-matrix multiplication routine. One of the options I am trying is to use OpenMP to offload the multiplication to multiple GPUs. I have a piece of code that works in the sense that it uses multiple GPUs and it produces the correct answer, but it doesn’t use multiple GPUs at the same time. Instead it uses one GPU after the other. I don’t seem to be able to do the offloading in such a way that multiple GPUs participate simultaneously. I am assuming I am doing something wrong and I am hoping that someone could give some pointers as to how to fix this.
The subroutine in question, living in calc_mxm.F90, is
subroutine calc_mxm(a,b,c,nsize)
!
! Calculate C = C + A * B
!
use omp_lib
implicit none
integer, parameter :: dp = selected_real_kind(15,3)
!
! Input/Output Arguments
integer, intent(in) :: nsize
real(kind=dp), intent(in) :: a(nsize,nsize)
real(kind=dp), intent(in) :: b(nsize,nsize)
real(kind=dp), intent(inout) :: c(nsize,nsize)
!
! Local variables
integer :: k_dev ! Loop over devices
integer :: ijlo, ijhi
integer :: ii, jj, kk
integer :: num_per_thread
integer :: max_threads
integer :: num_devices
num_devices = omp_get_num_devices()
write(*,*)" num_devices = ",num_devices
max_threads = omp_get_max_threads()
!
num_per_thread = (nsize-1+num_devices)/num_devices
do k_dev = 0, num_devices-1
ijlo = k_dev*num_per_thread+1
ijhi = min((k_dev+1)*num_per_thread,nsize)
!$omp target device(k_dev) map(tofrom: c(:,ijlo:ijhi)) map(to: a, b(:,ijlo:ijhi), ijlo, ijhi, nsize) nowait
!$omp teams distribute parallel do collapse(2) default(shared) firstprivate(ijlo, ijhi, nsize) private(jj,ii,kk)
do jj = ijlo, ijhi
do ii = 1, nsize
do kk = 1, nsize
c(ii,jj) = c(ii,jj) + a(ii,kk)* &
b(kk,jj)
enddo
enddo
enddo
!$omp end teams distribute parallel do
!$omp end target
end do
!$omp taskwait
!
end subroutine calc_mxm
The program that uses this routine sits in mxm.F and reads
program mxm
!
! Simple program to comput C = A * B
!
implicit none
integer, parameter :: dp = selected_real_kind(15,3)
integer :: nsize ! the dimension of the matrices
!
logical :: overify
!
real(kind=dp) :: c_min, c_max
!
real(kind=dp), allocatable :: a(:,:)
real(kind=dp), allocatable :: b(:,:)
real(kind=dp), allocatable :: c(:,:)
!
call read_size(nsize,overify)
write(*,*)" size: ",nsize," verify: ",overify
!
! Define the problem
! If (overify) set C = -A*B
!
allocate(a(nsize,nsize))
allocate(b(nsize,nsize))
allocate(c(nsize,nsize))
call init_data(nsize,a,b,c,overify)
!
! Calculate C = C + A*B
!
call calc_mxm(a,b,c,nsize)
!
! Check whether C = 0
!
if (overify) then
c_min = minval(c)
c_max = maxval(c)
write(*,*)"c min,max = ",c_min,c_max
endif
deallocate(c)
deallocate(b)
deallocate(a)
!
end program mxm
!
!-----------------------------------------------------------------------
!
subroutine read_size(nsize,overify)
implicit none
integer, intent(out) :: nsize
logical, intent(out) :: overify
character(len=12) :: arg
integer :: narg
integer :: ierror
!
narg = command_argument_count()
if (.not.(narg.eq.1.or.narg.eq.2)) then
write(*,*)" The program expects 1 command line argument"
write(*,*)" specifying the dimension of the matrices."
write(*,*)" It got ",narg," arguments."
stop 10
endif
call get_command_argument(1,arg)
read(arg,"(i12)")nsize
overify = .false.
if (narg.eq.2) then
call get_command_argument(2,arg)
read(arg,"(l8)")overify
endif
!
end subroutine read_size
!
!-----------------------------------------------------------------------
!
subroutine init_data(nsize,a_all,b_all,c_all,overify)
implicit none
integer, intent(in) :: nsize
logical, intent(in) :: overify
integer, parameter :: dp = selected_real_kind(15)
real(kind=dp), intent(out) :: a_all(nsize,nsize)
real(kind=dp), intent(out) :: b_all(nsize,nsize)
real(kind=dp), intent(out) :: c_all(nsize,nsize)
!
integer :: ierror
!
c_all = 0.0_dp
call random_number(a_all)
call random_number(b_all)
! Convert random numbers to range [-1,1] using matrix notation
a_all = 2.0_dp * a_all - 1.0_dp
b_all = 2.0_dp * b_all - 1.0_dp
if (overify) then
c_all = -matmul(a_all,b_all)
endif
end subroutine init_data
I am compiling the code with the commands
nvfortran -Minstrument -Minfo -Mcache_align -c -mp=gpu -O2 -g mxm.F -o mxm.o -traceback -lnvhpcwrapnvtx
nvfortran -Minstrument -Minfo=mp -Mcache_align -c -mp=gpu -O2 -g calc_mxm.F90 -o calc_mxm.o -traceback -lnvhpcwrapnvtx
nvfortran -Minstrument -Minfo -Mcache_align -mp=gpu -O2 -g mxm.o calc_mxm.o -o mpi-mxm -traceback -lnvhpcwrapnvtx
The nvfortran compiler I am using is
$ nvfortran --version
nvfortran 25.7-0 64-bit target on x86-64 Linux -tp sapphirerapids
NVIDIA Compilers and Tools
Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
I am running the code as
nsys profile --trace=openmp,cuda,nvtx ./mpi-mxm 1600 T
nsys profile --trace=openmp,cuda,nvtx ./mpi-mxm 16000
The first time doesn’t take long and verifies whether the results are correct. The second does much more compute and produces a sensible profile. The matrix dimensions are chosen so that every GPU operates on an integer number of cache lines to avoid any false data sharing.
About the GPUs nvidia-smi says:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:27:00.0 Off | 0 |
| N/A 30C P0 50W / 310W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 PCIe On | 00000000:38:00.0 Off | 0 |
| N/A 29C P0 50W / 310W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 PCIe On | 00000000:98:00.0 Off | 0 |
| N/A 30C P0 49W / 310W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 PCIe On | 00000000:C8:00.0 Off | 0 |
| N/A 29C P0 48W / 310W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
Any suggestions are greatly appreciated.
Thank you, Huub




