a simple Openacc question from novice

From the code below, I’d like to get the array ‘VF’. Other arrays such as ‘temp_c’ and ‘temp_VF’ are not needed at the end of the day, and thus I know that they might be declared as being created. The problem is that ‘an’, the number of the outermost loop and also the determinant of the size of the arrays ‘temp_c’ and ‘temp_VF’ are so big that I end up getting the out-of-memory error for GPU.

!$acc parallel loop collapse(4) gang worker vector
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
	do iaa = 1, an
	 	temp_c(ia,ie,ip,im,iet,iaa) = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa)

		if ( temp_c(ia,ie,ip,im,iet,iaa) < 0.0d0 ) then
			temp_VF(ia,ie,ip,im,iet,iaa) = -1.0d10
		else
			temp_VF(ia,ie,ip,im,iet,iaa) = temp_c(ia,ie,ip,im,iet,iaa)**0.5d0
		end if
	end do
end do
end do
end do
end do
end do
!$acc end parallel loop 
!$acc parallel loop collapse(3) gang worker vector
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
	VF(ia,ie,ip,im,iet) = maxval(temp_VF(ia,ie,ip,im,iet,:))
end do
end do
end do
end do
end do
!$acc end parallel loop

Thus, I modified the code as follows:

!$acc kernerls loop
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
!$acc loop private(temp_c, temp_VF)
do iet = 1, etn
	do iaa = 1, an
		temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa)
		if ( temp_c < 0.0d0 ) then
			temp_VF(iaa) = -1.0d10
		else
			temp_VF(iaa) = temp_c**0.5d0
		end if
	end do
end do
VF_HP(ia,ie,ip,im,iet) = maxval(temp_VF)
end do
end do
end do
end do

I’m not sure whether this is an efficient way of doing my original intention of taking care of the out-of-memory situation. Any suggestions are very much appreciated.

Hi limtaejun,

I’m thinking that you don’t need to use “temp_VF” at all and could instead just do a max reduction on a scalar in the inner loop. Something like:

!$acc kernels loop 
do ia = 1, an 
do ie = 1, en 
do ip = 1, pn 
do im = 1, mn 
do iet = 1, etn 
   maxv = 0.0
   do iaa = 1, an 
       temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa) 
       if (temp_c .gt. 0.0d0) then
             temp_c=temp_c**0.5.d0
       endif
       maxv = max(maxv,temp_c) 
   end do 
   VF_HP(ia,ie,ip,im,iet) = maxv
end do 
end do 
end do 
end do 
end do

Of course, I don’t know your algorithm so use this only if it works in your context.

As for you’re out-of-memory question. The second case is a valid approach sans the typos and putting the assignment to VF_HP at the wrong loop level (it should be inside of the iet loop). Though, scalars are private by default so no need to include “temp_c”.

For each variable in the private clause, an array of those variables will be created, one per vector, worker, or gang depending upon the schedule of the loop. So here, I’d recommend putting “temp_VF” in a private clause on the outer gang loop, and make “iaa” a vector loop. This way each gang creates a private copy of “temp_VF” which is then shared amongst each vector in the gang.

You can then control the amount of memory used by either using the “gangs()” clause to size the number of gangs, or collapse fewer of the outer loops. In both cases you’re limiting parallelization and thus may impact performance.

Something like the following:

!$acc kernels loop gang(<num_gangs>) collapse(<nlevels>) private(temp_VF) 
do ia = 1, an 
do ie = 1, en 
do ip = 1, pn 
do im = 1, mn  
do iet = 1, etn 
   !$acc loop vector
   do iaa = 1, an 
      temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa) 
      if ( temp_c < 0.0d0 ) then 
         temp_VF(iaa) = -1.0d10 
      else 
         temp_VF(iaa) = temp_c**0.5d0 
      end if 
   end do 
   VF_HP(ia,ie,ip,im,iet) = maxval(temp_VF) 
end do
end do 
end do 
end do 
end do

Hope this helps,
Mat

Thanks a lot for your elaborated answer, Mat. It really gave me a chance to think of how to improve the code using openacc. I’m sorry for keeping bothering you, but let me ask two more questions:

(i) as you pointed out, temp_VF seems unnecessary in my code. If so, is there still any benefit to go with the 2nd code you suggested in terms of memory and efficiency?

(ii) regarding the 1st code you suggested, if my GPU memory still doesn’t afford the size of VF_HP, should I take out some of loops out of openacc directive (See the code below). Would there be any way of getting around?


do ia = 1, an 
do ie = 1, en 

$acc kernels loop 
do ip = 1, pn 
do im = 1, mn 
do iet = 1, etn 
   maxv = 0.0 
   do iaa = 1, an 
       temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa) 
       if (temp_c .gt. 0.0d0) then 
             temp_c=temp_c**0.5.d0 
       endif 
       maxv = max(maxv,temp_c) 
   end do 
   VF_HP(ia,ie,ip,im,iet) = maxv 
end do 
end do 
end do 

end do 
end do
(i) as you pointed out, temp_VF seems unnecessary in my code. If so, is there still any benefit to go with the 2nd code you suggested in terms of memory and efficiency?

For your code, I don’t see any benefit.

However, there may be other codes which benefit from manually privatizing an array for a vector or worker loop. By manually privatizing I’m meaning instead of putting an array in a “private” clause, you add extra dimensions to the array, one per outer loop. When “private” is used, the compiler will create a copy of the array for every vector which can lead to poor data access as the vectors can’t access the memory in a contiguous block. By manually privatizing the arrays, you have better control over data layout and can have the vectors access the data across the stride-1 dimension (Columns in Fortran, Rows in C/C++). The cost is that manual privatization may consume more global memory.

(ii) regarding the 1st code you suggested, if my GPU memory still doesn’t afford the size of VF_HP, should I take out some of loops out of openacc directive (See the code below). Would there be any way of getting around?

You have to block the code so only a portion of the array is on the device at a time or better yet, use multiple GPUs. So you’re example is the right direction. The only thing you need to add are data directives and copy only the used portion of the VF_HP array.

For multiple GPUs, you can either use MPI or put OpenMP directives around the outer loop. Something like the following:

numDev = acc_get_num_devices()

!$omp parallel num_threads(numDev)

thdid = omp_get_thread_num()

!set each OMP thread's device
acc_set_device(acc_get_device_type(), thdid)

!$acc data copyin(pG, mG, etG, aG)

!$omp do collapse(2)
do ia = 1, an 
do ie = 1, en 

aGtemp = aG(ia)
eGtemp = eG(ie)

$acc kernels loop data copyout(VF_HP(ia,ie,:,:,:))
do ip = 1, pn 
do im = 1, mn 
do iet = 1, etn 
   maxv = 0.0 
   do iaa = 1, an 
       temp_c = aGtemp + eGtemp + pG(ip) + mG(im) + etG(iet) - aG(iaa) 
       if (temp_c .gt. 0.0d0) then 
             temp_c=temp_c**0.5.d0 
       endif 
       maxv = max(maxv,temp_c) 
   end do 
   VF_HP(ia,ie,ip,im,iet) = maxv 
end do 
end do 
end do 

end do 
end do 
!$acc end data
!$omp end parallel