OpenMP Offload: additional memory usage on GPU 0 for code running on other GPUs

goduck777 · April 28, 2023, 11:45pm

When lauching a compiled program using OpenMP offload on a computing node with multiple GPUs, the program will always occupy certain amount of GPU rank 0 RAM, no matter if a device number is specified. For example,

program matrix_multiply
   use omp_lib
   use openacc
   implicit none
   integer :: i, j, k, myid, m, n, compiled_for, option
   integer, parameter :: fd = 11
   integer :: t1, t2, dt, count_rate, count_max
   real, allocatable, dimension(:,:) :: a, b, c
   real :: tmp, secs
   real :: temp2(5000)


   m=3

   n = 1000*2**(m-1)
   allocate( a(n,n), b(n,n), c(n,n) )

   do j=1,n
      do i=1,n
         a(i,j) = real(i + j)
         b(i,j) = real(i - j)
      enddo
   enddo

!$acc set device_num(1)
!$omp target teams distribute collapse(2) private(temp2) device(1)
!$acc data copyin(a,b) copy(c)
!$acc parallel loop gang vector collapse(2) private(temp2)
   do j=1,n
      do i=1,n
         tmp = 0.0
!$omp parallel do
         do k=1,5000
            temp2(k)=0.
         enddo
!$acc loop seq
!$omp parallel do reduction(+:tmp)
         do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
         enddo
         c(i,j) = tmp
         c(i,j) = temp2(i)
      enddo
   enddo
!$acc end data

   deallocate(a, b, c)


end program matrix_multiply

When running this code compiled with -mp=gpu on a node, it is shown that GPU 1 has 600MB RAM used, but GPU 0 also has 300MB occupied. If compiled with OpenACC, GPU 0 has zero memory used.

This is an issue when writing a MPI program utilizing multiple GPUs on a node, as the additional occupation can leads to OOM.

MatColgrove · May 1, 2023, 3:57pm

This is a known issue and our engineers are working is now. With OpenACC the context creating is delayed until the first construct is entered and they are in the process of getting OpenMP target to match. While the extra context wastes memory, it shouldn’t effect performance or correctness.

I went ahead and added a new issue report, TPR #33544, so engineering can use your example as another test case, as well as having me let you know when the issue has been fixed in a release.

Now with MPI, you may still see this initial context being created on device 0. It’s dependent on the MPI implementation, but often when CUDA Aware MPI is enabled, this context gets created when MPI_Init is called. The only work around is to wrap your program in a shell script which sets the environment variable “CUDA_VISIBLE_DEVICES” to the local rank’s device.

-Mat

system · May 15, 2023, 3:58pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

MatColgrove · May 25, 2023, 9:09pm

Hi goduck777,

FYI, we’ve updated OpenMP so the context creation is delayed until first use and for your program, this context will only occur on device 1.

-Mat

Topic		Replies	Views
MPI + openacc, process rank 0 consume a lot of memory nvc, nvc++ and nvfortran	1	12	July 22, 2024
"invalid context" when mixing OpenMP, OpenAcc Legacy PGI Compilers	2	3229	January 31, 2014
Different GPU memory usage between OpenACC and OpenMP Offload nvc, nvc++ and nvfortran	10	831	April 28, 2023
Questions about omp offload and memory transfer nvc, nvc++ and nvfortran	13	1355	October 15, 2021
OpenACC Multi GPU Memory Informations nvc, nvc++ and nvfortran	6	327	February 14, 2024
OpenMp Target Map does't work with member variables nvc, nvc++ and nvfortran gpu	2	553	October 23, 2023
MultiGPU, multithread, and establishing contexts Odd (but good) behavior with OpenMP affecting multi CUDA Programming and Performance	4	6239	July 10, 2009
Using multiple GPUs Legacy PGI Compilers	7	22072	August 11, 2009
Multiple GPUs with nvc++ -stdpar nvc, nvc++ and nvfortran	11	1293	January 2, 2024
OpenACC usage inside OpenMP constructs Legacy PGI Compilers	6	3855	August 26, 2019

OpenMP Offload: additional memory usage on GPU 0 for code running on other GPUs

Related topics