When lauching a compiled program using OpenMP offload on a computing node with multiple GPUs, the program will always occupy certain amount of GPU rank 0 RAM, no matter if a device number is specified. For example,
program matrix_multiply
use omp_lib
use openacc
implicit none
integer :: i, j, k, myid, m, n, compiled_for, option
integer, parameter :: fd = 11
integer :: t1, t2, dt, count_rate, count_max
real, allocatable, dimension(:,:) :: a, b, c
real :: tmp, secs
real :: temp2(5000)
m=3
n = 1000*2**(m-1)
allocate( a(n,n), b(n,n), c(n,n) )
do j=1,n
do i=1,n
a(i,j) = real(i + j)
b(i,j) = real(i - j)
enddo
enddo
!$acc set device_num(1)
!$omp target teams distribute collapse(2) private(temp2) device(1)
!$acc data copyin(a,b) copy(c)
!$acc parallel loop gang vector collapse(2) private(temp2)
do j=1,n
do i=1,n
tmp = 0.0
!$omp parallel do
do k=1,5000
temp2(k)=0.
enddo
!$acc loop seq
!$omp parallel do reduction(+:tmp)
do k=1,n
tmp = tmp + a(i,k) * b(k,j)
enddo
c(i,j) = tmp
c(i,j) = temp2(i)
enddo
enddo
!$acc end data
deallocate(a, b, c)
end program matrix_multiply
When running this code compiled with -mp=gpu
on a node, it is shown that GPU 1 has 600MB RAM used, but GPU 0 also has 300MB occupied. If compiled with OpenACC, GPU 0 has zero memory used.
This is an issue when writing a MPI program utilizing multiple GPUs on a node, as the additional occupation can leads to OOM.