# a simple Openacc question from novice

From the code below, I’d like to get the array ‘VF’. Other arrays such as ‘temp_c’ and ‘temp_VF’ are not needed at the end of the day, and thus I know that they might be declared as being created. The problem is that ‘an’, the number of the outermost loop and also the determinant of the size of the arrays ‘temp_c’ and ‘temp_VF’ are so big that I end up getting the out-of-memory error for GPU.

``````!\$acc parallel loop collapse(4) gang worker vector
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
do iaa = 1, an
temp_c(ia,ie,ip,im,iet,iaa) = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa)

if ( temp_c(ia,ie,ip,im,iet,iaa) < 0.0d0 ) then
temp_VF(ia,ie,ip,im,iet,iaa) = -1.0d10
else
temp_VF(ia,ie,ip,im,iet,iaa) = temp_c(ia,ie,ip,im,iet,iaa)**0.5d0
end if
end do
end do
end do
end do
end do
end do
!\$acc end parallel loop
!\$acc parallel loop collapse(3) gang worker vector
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
VF(ia,ie,ip,im,iet) = maxval(temp_VF(ia,ie,ip,im,iet,:))
end do
end do
end do
end do
end do
!\$acc end parallel loop
``````

Thus, I modified the code as follows:

``````!\$acc kernerls loop
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
!\$acc loop private(temp_c, temp_VF)
do iet = 1, etn
do iaa = 1, an
temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa)
if ( temp_c < 0.0d0 ) then
temp_VF(iaa) = -1.0d10
else
temp_VF(iaa) = temp_c**0.5d0
end if
end do
end do
VF_HP(ia,ie,ip,im,iet) = maxval(temp_VF)
end do
end do
end do
end do
``````

I’m not sure whether this is an efficient way of doing my original intention of taking care of the out-of-memory situation. Any suggestions are very much appreciated.

Hi limtaejun,

I’m thinking that you don’t need to use “temp_VF” at all and could instead just do a max reduction on a scalar in the inner loop. Something like:

``````!\$acc kernels loop
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
maxv = 0.0
do iaa = 1, an
temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa)
if (temp_c .gt. 0.0d0) then
temp_c=temp_c**0.5.d0
endif
maxv = max(maxv,temp_c)
end do
VF_HP(ia,ie,ip,im,iet) = maxv
end do
end do
end do
end do
end do
``````

Of course, I don’t know your algorithm so use this only if it works in your context.

As for you’re out-of-memory question. The second case is a valid approach sans the typos and putting the assignment to VF_HP at the wrong loop level (it should be inside of the iet loop). Though, scalars are private by default so no need to include “temp_c”.

For each variable in the private clause, an array of those variables will be created, one per vector, worker, or gang depending upon the schedule of the loop. So here, I’d recommend putting “temp_VF” in a private clause on the outer gang loop, and make “iaa” a vector loop. This way each gang creates a private copy of “temp_VF” which is then shared amongst each vector in the gang.

You can then control the amount of memory used by either using the “gangs()” clause to size the number of gangs, or collapse fewer of the outer loops. In both cases you’re limiting parallelization and thus may impact performance.

Something like the following:

``````!\$acc kernels loop gang(<num_gangs>) collapse(<nlevels>) private(temp_VF)
do ia = 1, an
do ie = 1, en
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
!\$acc loop vector
do iaa = 1, an
temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa)
if ( temp_c < 0.0d0 ) then
temp_VF(iaa) = -1.0d10
else
temp_VF(iaa) = temp_c**0.5d0
end if
end do
VF_HP(ia,ie,ip,im,iet) = maxval(temp_VF)
end do
end do
end do
end do
end do
``````

Hope this helps,
Mat

Thanks a lot for your elaborated answer, Mat. It really gave me a chance to think of how to improve the code using openacc. I’m sorry for keeping bothering you, but let me ask two more questions:

(i) as you pointed out, temp_VF seems unnecessary in my code. If so, is there still any benefit to go with the 2nd code you suggested in terms of memory and efficiency?

(ii) regarding the 1st code you suggested, if my GPU memory still doesn’t afford the size of VF_HP, should I take out some of loops out of openacc directive (See the code below). Would there be any way of getting around?

``````do ia = 1, an
do ie = 1, en

\$acc kernels loop
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
maxv = 0.0
do iaa = 1, an
temp_c = aG(ia) + eG(ie) + pG(ip) + mG(im) + etG(iet) - aG(iaa)
if (temp_c .gt. 0.0d0) then
temp_c=temp_c**0.5.d0
endif
maxv = max(maxv,temp_c)
end do
VF_HP(ia,ie,ip,im,iet) = maxv
end do
end do
end do

end do
end do
``````
``````(i) as you pointed out, temp_VF seems unnecessary in my code. If so, is there still any benefit to go with the 2nd code you suggested in terms of memory and efficiency?
``````

For your code, I don’t see any benefit.

However, there may be other codes which benefit from manually privatizing an array for a vector or worker loop. By manually privatizing I’m meaning instead of putting an array in a “private” clause, you add extra dimensions to the array, one per outer loop. When “private” is used, the compiler will create a copy of the array for every vector which can lead to poor data access as the vectors can’t access the memory in a contiguous block. By manually privatizing the arrays, you have better control over data layout and can have the vectors access the data across the stride-1 dimension (Columns in Fortran, Rows in C/C++). The cost is that manual privatization may consume more global memory.

(ii) regarding the 1st code you suggested, if my GPU memory still doesn’t afford the size of VF_HP, should I take out some of loops out of openacc directive (See the code below). Would there be any way of getting around?

You have to block the code so only a portion of the array is on the device at a time or better yet, use multiple GPUs. So you’re example is the right direction. The only thing you need to add are data directives and copy only the used portion of the VF_HP array.

For multiple GPUs, you can either use MPI or put OpenMP directives around the outer loop. Something like the following:

``````numDev = acc_get_num_devices()

acc_set_device(acc_get_device_type(), thdid)

!\$acc data copyin(pG, mG, etG, aG)

!\$omp do collapse(2)
do ia = 1, an
do ie = 1, en

aGtemp = aG(ia)
eGtemp = eG(ie)

\$acc kernels loop data copyout(VF_HP(ia,ie,:,:,:))
do ip = 1, pn
do im = 1, mn
do iet = 1, etn
maxv = 0.0
do iaa = 1, an
temp_c = aGtemp + eGtemp + pG(ip) + mG(im) + etG(iet) - aG(iaa)
if (temp_c .gt. 0.0d0) then
temp_c=temp_c**0.5.d0
endif
maxv = max(maxv,temp_c)
end do
VF_HP(ia,ie,ip,im,iet) = maxv
end do
end do
end do

end do
end do
!\$acc end data
!\$omp end parallel
``````