OpenMP Offload: additional memory usage on GPU 0 for code running on other GPUs

When lauching a compiled program using OpenMP offload on a computing node with multiple GPUs, the program will always occupy certain amount of GPU rank 0 RAM, no matter if a device number is specified. For example,

program matrix_multiply
   use omp_lib
   use openacc
   implicit none
   integer :: i, j, k, myid, m, n, compiled_for, option
   integer, parameter :: fd = 11
   integer :: t1, t2, dt, count_rate, count_max
   real, allocatable, dimension(:,:) :: a, b, c
   real :: tmp, secs
   real :: temp2(5000)


   n = 1000*2**(m-1)
   allocate( a(n,n), b(n,n), c(n,n) )

   do j=1,n
      do i=1,n
         a(i,j) = real(i + j)
         b(i,j) = real(i - j)

!$acc set device_num(1)
!$omp target teams distribute collapse(2) private(temp2) device(1)
!$acc data copyin(a,b) copy(c)
!$acc parallel loop gang vector collapse(2) private(temp2)
   do j=1,n
      do i=1,n
         tmp = 0.0
!$omp parallel do
         do k=1,5000
!$acc loop seq
!$omp parallel do reduction(+:tmp)
         do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
         c(i,j) = tmp
         c(i,j) = temp2(i)
!$acc end data

   deallocate(a, b, c)

end program matrix_multiply

When running this code compiled with -mp=gpu on a node, it is shown that GPU 1 has 600MB RAM used, but GPU 0 also has 300MB occupied. If compiled with OpenACC, GPU 0 has zero memory used.

This is an issue when writing a MPI program utilizing multiple GPUs on a node, as the additional occupation can leads to OOM.

This is a known issue and our engineers are working is now. With OpenACC the context creating is delayed until the first construct is entered and they are in the process of getting OpenMP target to match. While the extra context wastes memory, it shouldn’t effect performance or correctness.

I went ahead and added a new issue report, TPR #33544, so engineering can use your example as another test case, as well as having me let you know when the issue has been fixed in a release.

Now with MPI, you may still see this initial context being created on device 0. It’s dependent on the MPI implementation, but often when CUDA Aware MPI is enabled, this context gets created when MPI_Init is called. The only work around is to wrap your program in a shell script which sets the environment variable “CUDA_VISIBLE_DEVICES” to the local rank’s device.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Hi goduck777,

FYI, we’ve updated OpenMP so the context creation is delayed until first use and for your program, this context will only occur on device 1.