a simple Openacc question from novice

limtaejun · August 31, 2017, 2:53am

From the code below, I’d like to get the array ‘VF’. Other arrays such as ‘temp_c’ and ‘temp_VF’ are not needed at the end of the day, and thus I know that they might be declared as being created. The problem is that ‘an’, the number of the outermost loop and also the determinant of the size of the arrays ‘temp_c’ and ‘temp_VF’ are so big that I end up getting the out-of-memory error for GPU.

!$acc parallel loop collapse(4) gang worker vector
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
	do iaa = 1, an
	 	temp_c(ia,ie,ip,im,iet,iaa) = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa)

		if ( temp_c(ia,ie,ip,im,iet,iaa) < 0.0d0 ) then
			temp_VF(ia,ie,ip,im,iet,iaa) = -1.0d10
		else
			temp_VF(ia,ie,ip,im,iet,iaa) = temp_c(ia,ie,ip,im,iet,iaa)**0.5d0
		end if
	end do
end do
end do
end do
end do
end do
!$acc end parallel loop 
!$acc parallel loop collapse(3) gang worker vector
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
	VF(ia,ie,ip,im,iet) = maxval(temp_VF(ia,ie,ip,im,iet,:))
end do
end do
end do
end do
end do
!$acc end parallel loop

Thus, I modified the code as follows:

!$acc kernerls loop
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
!$acc loop private(temp_c, temp_VF)
do iet = 1, etn
	do iaa = 1, an
		temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa)
		if ( temp_c < 0.0d0 ) then
			temp_VF(iaa) = -1.0d10
		else
			temp_VF(iaa) = temp_c**0.5d0
		end if
	end do
end do
VF_HP(ia,ie,ip,im,iet) = maxval(temp_VF)
end do
end do
end do
end do

I’m not sure whether this is an efficient way of doing my original intention of taking care of the out-of-memory situation. Any suggestions are very much appreciated.

MatColgrove · August 31, 2017, 4:09pm

Hi limtaejun,

I’m thinking that you don’t need to use “temp_VF” at all and could instead just do a max reduction on a scalar in the inner loop. Something like:

!$acc kernels loop 
do ia = 1, an 
do ie = 1, en 
do ip = 1, pn 
do im = 1, mn 
do iet = 1, etn 
   maxv = 0.0
   do iaa = 1, an 
       temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa) 
       if (temp_c .gt. 0.0d0) then
             temp_c=temp_c**0.5.d0
       endif
       maxv = max(maxv,temp_c) 
   end do 
   VF_HP(ia,ie,ip,im,iet) = maxv
end do 
end do 
end do 
end do 
end do

Of course, I don’t know your algorithm so use this only if it works in your context.

As for you’re out-of-memory question. The second case is a valid approach sans the typos and putting the assignment to VF_HP at the wrong loop level (it should be inside of the iet loop). Though, scalars are private by default so no need to include “temp_c”.

For each variable in the private clause, an array of those variables will be created, one per vector, worker, or gang depending upon the schedule of the loop. So here, I’d recommend putting “temp_VF” in a private clause on the outer gang loop, and make “iaa” a vector loop. This way each gang creates a private copy of “temp_VF” which is then shared amongst each vector in the gang.

You can then control the amount of memory used by either using the “gangs()” clause to size the number of gangs, or collapse fewer of the outer loops. In both cases you’re limiting parallelization and thus may impact performance.

Something like the following:

!$acc kernels loop gang(<num_gangs>) collapse(<nlevels>) private(temp_VF) 
do ia = 1, an 
do ie = 1, en 
do ip = 1, pn 
do im = 1, mn  
do iet = 1, etn 
   !$acc loop vector
   do iaa = 1, an 
      temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa) 
      if ( temp_c < 0.0d0 ) then 
         temp_VF(iaa) = -1.0d10 
      else 
         temp_VF(iaa) = temp_c**0.5d0 
      end if 
   end do 
   VF_HP(ia,ie,ip,im,iet) = maxval(temp_VF) 
end do
end do 
end do 
end do 
end do

Hope this helps,
Mat

limtaejun · September 1, 2017, 8:14am

Thanks a lot for your elaborated answer, Mat. It really gave me a chance to think of how to improve the code using openacc. I’m sorry for keeping bothering you, but let me ask two more questions:

(i) as you pointed out, temp_VF seems unnecessary in my code. If so, is there still any benefit to go with the 2nd code you suggested in terms of memory and efficiency?

(ii) regarding the 1st code you suggested, if my GPU memory still doesn’t afford the size of VF_HP, should I take out some of loops out of openacc directive (See the code below). Would there be any way of getting around?

do ia = 1, an 
do ie = 1, en 

$acc kernels loop 
do ip = 1, pn 
do im = 1, mn 
do iet = 1, etn 
   maxv = 0.0 
   do iaa = 1, an 
       temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa) 
       if (temp_c .gt. 0.0d0) then 
             temp_c=temp_c**0.5.d0 
       endif 
       maxv = max(maxv,temp_c) 
   end do 
   VF_HP(ia,ie,ip,im,iet) = maxv 
end do 
end do 
end do 

end do 
end do

MatColgrove · September 5, 2017, 3:08pm

(i) as you pointed out, temp_VF seems unnecessary in my code. If so, is there still any benefit to go with the 2nd code you suggested in terms of memory and efficiency?

For your code, I don’t see any benefit.

However, there may be other codes which benefit from manually privatizing an array for a vector or worker loop. By manually privatizing I’m meaning instead of putting an array in a “private” clause, you add extra dimensions to the array, one per outer loop. When “private” is used, the compiler will create a copy of the array for every vector which can lead to poor data access as the vectors can’t access the memory in a contiguous block. By manually privatizing the arrays, you have better control over data layout and can have the vectors access the data across the stride-1 dimension (Columns in Fortran, Rows in C/C++). The cost is that manual privatization may consume more global memory.

(ii) regarding the 1st code you suggested, if my GPU memory still doesn’t afford the size of VF_HP, should I take out some of loops out of openacc directive (See the code below). Would there be any way of getting around?

You have to block the code so only a portion of the array is on the device at a time or better yet, use multiple GPUs. So you’re example is the right direction. The only thing you need to add are data directives and copy only the used portion of the VF_HP array.

For multiple GPUs, you can either use MPI or put OpenMP directives around the outer loop. Something like the following:

numDev = acc_get_num_devices()

!$omp parallel num_threads(numDev)

thdid = omp_get_thread_num()

!set each OMP thread's device
acc_set_device(acc_get_device_type(), thdid)

!$acc data copyin(pG, mG, etG, aG)

!$omp do collapse(2)
do ia = 1, an 
do ie = 1, en 

aGtemp = aG(ia)
eGtemp = eG(ie)

$acc kernels loop data copyout(VF_HP(ia,ie,:,:,:))
do ip = 1, pn 
do im = 1, mn 
do iet = 1, etn 
   maxv = 0.0 
   do iaa = 1, an 
       temp_c = aGtemp + eGtemp + pG(ip) + mG(im) + etG(iet) - aG(iaa) 
       if (temp_c .gt. 0.0d0) then 
             temp_c=temp_c**0.5.d0 
       endif 
       maxv = max(maxv,temp_c) 
   end do 
   VF_HP(ia,ie,ip,im,iet) = maxv 
end do 
end do 
end do 

end do 
end do 
!$acc end data
!$omp end parallel

Topic		Replies	Views
Pgfortran 20.4 and OpenACC giving "cudaLaunchKernel returned status 2: out of memory" Legacy PGI Compilers	17	1080	June 3, 2021
Different GPU memory usage between OpenACC and OpenMP Offload nvc, nvc++ and nvfortran	10	1015	April 28, 2023
Privatization of array Legacy PGI Compilers	9	17688	July 14, 2010
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9799	December 7, 2020
Need advices for optimizing heart of CFD code Legacy PGI Compilers	11	7145	July 13, 2016
Inconsistent performance with !$acc exit data copyout finalize and NV_ACC_MEM_MANAGE environmental variable nvc, nvc++ and nvfortran	4	295	July 11, 2024
Six Loops iteration and reduction Legacy PGI Compilers	15	8030	March 27, 2012
OpenACC on GPU help Legacy PGI Compilers	4	2313	April 20, 2018
call to cuMemHostUnregister returned error 700: Launch faile Legacy PGI Compilers	3	3311	July 29, 2013
Problem with simple loop structure Legacy PGI Compilers	2	2253	March 8, 2018

a simple Openacc question from novice

Related topics