Different GPU memory usage between OpenACC and OpenMP Offload

We are testing our GPU Fortran code using both OpenACC and OpenMP Offload. We find that using different directives can lead to very different GPU memory usage. Here is an example code

program matrix_multiply
   use omp_lib
   use openacc
   implicit none
   integer :: i, j, k, myid, m, n, compiled_for, option
   integer, parameter :: fd = 11
   integer :: t1, t2, dt, count_rate, count_max
   real, allocatable, dimension(:,:) :: a, b, c
   real :: tmp, secs
   real :: temp2(10000)


   m=4

   n = 1000*2**(m-1)
   allocate( a(n,n), b(n,n), c(n,n) )

   do j=1,n
      do i=1,n
         a(i,j) = real(i + j)
         b(i,j) = real(i - j)
      enddo
   enddo

!$omp target teams distribute parallel do collapse(2) private(temp2)
!$acc data copyin(a,b) copy(c)
!$acc parallel loop gang collapse(2) private(temp2)
   do j=1,n
      do i=1,n
         tmp = 0.0
         temp2(:)=0.
         !$acc loop seq
         do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
            temp2(:) = temp2(:) + a(i,k) * b(k,j)
         enddo
         c(i,j) = tmp
         c(i,j) = sum(temp2)
      enddo
   enddo
!$acc end data

   deallocate(a, b, c)


end program matrix_multiply

When running the code compiled using NVHPC Fortran with -mp=gpu, the GPU RAM usage is almost double as the usage of the version with -acc.

We found that the issue is related to the usage of thread private variable (temp2 here). If we remove that, the usage will be almost the same. But in our production code, we do need a temporary large array for each job.

I wonder if there is a way to reduce the RAM usage of OpenMP verson, making it similar to OpenACC version.

In the OpenACC version, you’re privatizing “temp2” on a just the gang loop, while in OpenMP you’re privatizing it for both the teams and threads. In other words, there’s far more private arrays being used in the OpenMP version. To be roughly equivalent, you’ll want to remove “parallel do” from the outer loop so only the teams gets private temp2 arrays.

Also, why aren’t you parallelizing the inner “k” loop? It’s parallelizable if you add a reduction clause and should give better performance.

program matrix_multiply
   use omp_lib
   use openacc
   implicit none
   integer :: i, j, k, myid, m, n, compiled_for, option
   integer, parameter :: fd = 11
   integer :: t1, t2, dt, count_rate, count_max
   real, allocatable, dimension(:,:) :: a, b, c
   real :: tmp, secs
   real :: temp2(10000)


   m=4

   n = 10*2**(m-1)
   allocate( a(n,n), b(n,n), c(n,n) )

   do j=1,n
      do i=1,n
         a(i,j) = real(i + j)
         b(i,j) = real(i - j)
      enddo
   enddo

!$omp target teams distribute collapse(2) private(temp2, tmp)
!$acc data copyin(a,b) copy(c)
!$acc parallel loop gang collapse(2) private(temp2)
   do j=1,n
      do i=1,n
         tmp = 0.0
         temp2(:)=0.
         !$omp  parallel do reduction(+:tmp)
         !$acc loop vector reduction(+:tmp)
         do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
            temp2(:) = temp2(:) + a(i,k) * b(k,j)
         enddo
         c(i,j) = tmp
         c(i,j) = sum(temp2)
      enddo
   enddo
!$acc end data

   deallocate(a, b, c)


end program matrix_multiply

Thank you for your reply. The option I put in the code is the best ones I chose after many tries.

  1. If I use !$omp target teams distribute instead of !$omp target teams distribute parallel do, I got almost the same ram usage, and the code run much slower. I am not sure what the code is doing here.

  2. For OpenACC, if I enable parallelization of the inner loop along vectors, the code runs at least 5x slower. I guess the reduction operation on vectors is very slow.

  3. If I use acc kernel instead of acc parallel loop, the compiler will parallelize the outer 2 loops but not the inner one.

The test was done in IBM Power9 with Nvidia V100 GPUs. Can you please check if you can reproduce the issue?

I also tried to add thread_limit(128) after omp target teams distribute parallel do, but it does not change anything.

I guess the reduction operation on vectors is very slow.

Scalar reductions have a bit of overhead, but nothing bad. However array reductions, especially for such a large array can be problematic.

Now I was cheating earlier, in that I was ignoring the array reduction of “temp2”. In general, it’s best to avoid reductions of large arrays given the overhead but also if the array size is unknown, the partial reduction arrays for each thread need to be allocated in the kernel itself, which can cause heap overflow issues.

So if this is a requirement of your production application, then, yes, it’s probably better to just parallelize the two outer loops to avoid the array reduction. Though you might try using OpenMP “target teams loop” in place of “target teams distribute parallel do”. The compiler will then schedule the teams to the outer loops and threads to the inner array syntax implied loop, as it does in OpenACC.

Thank you for your reply. I think you are mixing tw oissues here, the memory usage and the performance. Let me clarify

  1. The memory usage has nothing to do with the reduction. I testee it by removing the line of temp2 reduction
      temp2(:) = temp2(:) + a(i,k) * b(k,j)

and I got the same memory usage of OpenMP as before. Basically the memory usage is different as long as I have private clause enabled, and it seems that OpenMP and OpenACC have very different implementations.

Can you please try to reproduce this issue? I think it is bug on memory leakage when using private clouse with OpenMP Offload, and should be reported.

  1. The performance issue I mentioned in my second post has nothing to do with the large array reduction. I have tested the code by removing all the lines related to temp2, and still got the same results as described in post 2. I mentioned the performance issue just to answer your questoin. Basically to got a reasonable performance, I can only do omp target teams distribute parallel do or use OpenACC.

  2. I am aware of the alternative target teams loop. My understanding is that this is just a copy of the OpenACC clause for OpenMP, and it is only implemented in NVHPC but not part of the OpenMP standard. The goal of our OpenMP development is to adapt the code for various compilers and hardware, so we tend to not use any propriety standard.

Yes, I understand that there are two issues.

Memory of the private array:

Again, the difference in memory usage is due to how you are scheduling the loops.

For OpenMP here’s the mapping

!$omp target teams distribute parallel do collapse(2) private(temp2)

“teams” → enter a team region (i.e. a CUDA kernel) where a team maps to a CUDA Block
“distribute” → the loops to distribute across the teams
“parallel” → enter a parallel region where the threads in this region map to a CUDA thread
“do” → the loops to distribute across the threads

Adding a “private” clause here means that every CUDA thread will get it’s own copy of the array. Hence the memory usage is: Num_Blocks X Num_Threads_Per_Block X Size_of_Array

For OpenACC:

!$acc parallel loop gang collapse(2) private(temp2)

“parallel” → Enter a compute regions (i.e. a CUDA kernel)
“loop” → Distribute loop iterations across the defined schedule (i.e. either gang, worker, vector, seq, or a combination)
“gang” → Use gangs for the loop distribution (gang maps to a CUDA Block)

Since you don’t specify “vector”, the compiler is free to apply as needed, It your case, it’s auto-parallelizing the inner array syntax across the vectors, but not on the outer loops.

Hence when adding a “private” clause on the gang loop every CUDA Block will get it’s own copy of the array. The memory usage is: Num_Blocks X Size_of_Array, which is much smaller than the OpenMP use case.

In order to reduce the memory usage by OpenMP, and match OpenACC, you would need to remove the “parallel do” from the outer loop so the array is only privatized across the teams (CUDA Blocks).

Note if you change

!$omp target teams distribute parallel do collapse(2) private(temp2)

to

!$omp target teams loop collapse(2) private(temp2)

The compiler will apply a similar schedule as is done in OpenACC, i.e. teams applied to the outer loops, and parallel applied to the inner array syntax. “loop” allows the compiler more freedom to do scheduling as opposed to “distribute parallel do” where you, the programmer, decides.

I am aware of the alternative target teams loop . My understanding is that this is just a copy of the OpenACC clause for OpenMP, and it is only implemented in NVHPC but not part of the OpenMP standard.

The “loop” directive is part of the OpenMP 5.0 standard: loop Construct

While NVHPC is ahead in implementation of “loop”, other compilers have or are adding support as well. My contacts with other compiler teams see the benefits of “loop”, especially for offloading.

Performance:

The performance issue I mentioned in my second post has nothing to do with the large array reduction. I have tested the code by removing all the lines related to temp2 , and still got the same results as described in post 2.

Ok, I was only testing with “temp2” included, and yes I see that using an outer “gang vector” loop is faster than using a gang outer / inner vector here. I profiled the code using Nsight-Compute and see that the reduction operation does have some impact, but the larger impact is due to memory and the array striding. Note that typically using inner vector helps, it’s just this particular case, it does not.

With “temp2” included, the compiler is vectorizing the array syntax so only gang is applied to the outer loops. In this case, also applying vector to the inner “k” loop helps by 2x. The question to me then is, what’s the performance impact of using “gang vector” on the outer collapsed loops with the caveat that the memory usage increases due to the extra private copies of temp2? In this case, it appears to be about 20% slower.

Thank you again. I understand your explanation about the relationship between the OpenACC/OpenMP terms like “gang” “distribute” with the hardware structure. I appreciate the information that “loop” directive is included in the new standard and I wish it can be supported by more compilers soon. For this post, I would like to focus on the memory usage, and I will open a new thread for performance issue.

I still have some questions about the memory usage. I have done some more tests and found that in certain cases, the memory usage due to private arrays is consistent with your description. Basically, I got memory usage “OpenMP teams distribute” < “OpenACC gang” < “OpenMP teams distribute parallel do”. However, there are cases that I got difference and weird results. Here is an example

program matrix_multiply
   use omp_lib
   use openacc
   implicit none
   integer :: i, j, k, myid, m, n, compiled_for, option
   integer, parameter :: fd = 11
   integer :: t1, t2, dt, count_rate, count_max
   real, allocatable, dimension(:,:) :: a, b, c
   real :: tmp, secs
   real :: temp2(5000)


   m=3

   n = 1000*2**(m-1)
   allocate( a(n,n), b(n,n), c(n,n) )

   do j=1,n
      do i=1,n
         a(i,j) = real(i + j)
         b(i,j) = real(i - j)
      enddo
   enddo

   !$omp target teams distribute parallel do collapse(2) private(temp2)
!$acc data copyin(a,b) copy(c)
!$acc parallel loop gang collapse(2) private(temp2)
   do j=1,n
      do i=1,n
         tmp = 0.0
         temp2(:)=0.
         !$acc loop seq
         do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
         enddo
         c(i,j) = tmp
         c(i,j) = temp2(i)
      enddo
   enddo
!$acc end data

   deallocate(a, b, c)


end program matrix_multiply

Here is the memory usage table

Memory Usage
omp target teams distribute 3473MB
omp target teams distribute parallel do 3457MB
acc parallel loop gang 1746MB

The results were obtained using IBM Power 9 CPU and a single Nvidia V100 GPU. The compiler is NVHPC 22.5

We found that the memory usage of omp target teams distribute is no less than omp target teams distribute parallel do, and is significantly larger than acc parallel loop gang.

I think the question here is whether there is a way to reach the same memory usage as OpenACC when using OpenMP. I know omp target loop is an option, but I want to know if it is achevable by just using omg target teams.

Try the following:

!$omp target teams distribute collapse(2) private(temp2)
!$acc data copyin(a,b) copy(c)
!$acc parallel loop gang vector collapse(2) private(temp2)
   do j=1,n
      do i=1,n
         tmp = 0.0
!$omp parallel do
         do l=1,5000
            temp2(l)=0.
         enddo
         !$acc loop seq
!$omp parallel do reduction(+:tmp)
         do k=1,n
            tmp = tmp + a(i,k) * b(k,j)
         enddo
         c(i,j) = tmp
         c(i,j) = temp2(i)
      enddo
   enddo
!$acc end data

This will get the memory down to ~500MB for OpenMP. OpenACC uses more memory since it’s using more blocks, 64K vs 108 with OpenMP.

With just distribute, you’re scheduling each block to one iteration, so it’s using 16,000,000 blocks. Even if you use “num_teams” to reduce this, it will still need to privatize the array for each loop iteration given an implicit “parallel” is applied to the outer loops.

This is one of the main reasons why I prefer using “loop”. The compiler can apply more analysis as opposed to “distribute” where it’s up to the user. In particular with array syntax, since it’s an implied loop, it can’t be parallelized without the compiler analysis.

Thank you very much for the answer! I finally understand the reason behind omp target teams distribute. To make it match the CUDA blocks not the thread, I need to include at least one omp parallel do inside the distribute section. Otherwise it can ead to redundant copy of the private arrays.

I agree that using omp target loop is preferable.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.