re-use of subroutine on several dynamic array

I would like to know what is the best practice to optimize a very modular code that make use many time of the same subrotuines, performing some geometrical transofrmation. A proxy for the code might look as follows

module dyn_arr_gpu
!!! displacement, velocity & acc
real*4, dimension(:), allocatable :: u1,u2,u3,u1b,u2b,u3b
real*4, dimension(:), allocatable :: ur1,ur2,ur3,ur1b,ur2b,ur3b
!$acc mirror(u1,u2,u3,u1b,u2b,u3b,ur1,ur2,ur3,ur1b,ur2b,ur3b)

contains

subroutine sync_idx_rot()
!$acc update device(idx_rot,idx_norot)
end subroutine sync_idx_rot

subroutine sync_kernel_host()
!$acc update host(kernel1,kernel2,kernel3,kernel_fs)
end subroutine sync_kernel_host

subroutine sync_kernel_device()
!$acc update device(kernel1,kernel2,kernel3,kernel_fs)
end subroutine sync_kernel_device
end module

now the subroutine that perform some operation on the dynamic array just defined:
v1 v2 v1r v2r are dummy arguments that makes the routine applicable to different arrays

subroutine base_change_vect_rec(v1,v2,n,v1r,v2r)
implicit none
integer, intent(in) :: n
real*4,dimension(n) :: v1,v2,v1r,v2r,v1_cp,v2_cp
integer :: i
!$acc region 
!$acc do private(v1_cp,v2_cp)
do i=1,n
v1_cp(i)=baser11(i)*v1(i)-baser12(i)*v2(i)
v2_cp(i)=baser12(i)*v1(i)+baser11(i)*v2(i)
v1r(i)=v1_cp(i);v2r(i)=v2_cp(i);
enddo
!$acc end region
end subroutine base_change_vect_rec

Finally the main code calling the subroutine:

do i=start_shot,nshots

	call sync_kernel_device()
	do j=1,i
			call base_change_vect_rec(ur1,ur2,npts3d,ur1,ur2)
			call base_change_vect_rec(ur1b,ur2b,npts3d,ur1b,ur2b)
enddo
	call sync_kernel_host()
enddo

The code as it is now, spends a very long time moving data, and compilers gives create always copiin and copyout of the array.

real*4,dimension(n) :: v1,v2,v1r,v2r,v1_cp,v2_cp

This is natural since the compiler does not know that the quantity that will be used calling the subroutine are already in the device being defined as mirrored and not needed after the loop.

The gpu cluster i am using is under maintenance today and i cant access the compilers output and timing results. If necessary i can post it later.

How can i optimise the code and avoid all those data moving back and forth the device?
I am using to compile the 11.7 version!

Thanks in advance

Hi acolomb,

Unfortunately, your code has major problem in that it’s not valid Fortran and you’ll need to fix it before my suggestion can help you.

NOTE 12.29
If there is a partial or complete overlap between the actual arguments associated with two different dummy arguments of the same procedure and the dummy arguments have neither the POINTER nor TARGET attribute, the overlapped portions shall not be defined, redefined, or become undefined during the execution of the procedure. For example, in

CALL SUB (A (1:5), A (3:9))

A (3:5) shall not be defined, redefined, or become undefined through the first dummy argument because it is part of the argument associated with the second dummy argument and shall not be defined, redefined, or become undefined through the second dummy argument because it is part of the argument associated with the first dummy argument. A (1:2) remains definable through the first dummy argument and A (6:9) remains definable through the second dummy argument.

Your code violates this by passing in the actual arguments and associating them with two dummy arguments (ur1 => v1,v1r, ur2=>v2,v2r) and then modifying the dummy arguments.

So back to the original question. Here you would put a data region around your outer “i” loop and use the “reflected” directive to tell the compiler that the data has already been copied.

Note that “private” clause means that each thread gets it’s own private copy of the variable. So using it on “v1_cp” and “v2_cp” means every thread will have n number of these. I think what you meant to use is the “local” clause which says allocate the variable on the device and do not copy data. Actually, in this you’re better off just making v1_cp and v2_cp scalars. Scalars are privatized by default.

With this in mind, I’d rewrite the code to look something like:

!$acc data region copy(ur1,ur2,ur1b,ur2b)
do i=start_shot,nshots

   call sync_kernel_device()
   do j=1,i
         call base_change_vect_rec(ur1,ur2,npts3d)
         call base_change_vect_rec(ur1b,ur2b,npts3d)
enddo
   call sync_kernel_host()
enddo 
!$acc end data region



subroutine base_change_vect_rec(v1,v2,n)
implicit none
integer, intent(in) :: n
real*4,dimension(n) :: v1,v2
real*4  :: v1_cp,v2_cp
integer :: i
!$acc reflected(v1,v2)
!$acc region
do i=1,n
v1_cp=baser11(i)*v1(i)-baser12(i)*v2(i)
v2_cp=baser12(i)*v1(i)+baser11(i)*v2(i)
v1(i)=v1_cp;
v2(i)=v2_cp;
enddo
!$acc end region
end subroutine base_change_vect_rec

or if you do need v1_cp and v2_cp to be arrays, make them “local”.

subroutine base_change_vect_rec(v1,v2,n)
implicit none
integer, intent(in) :: n
real*4,dimension(n) :: v1,v2
real*4,dimension(n)  :: v1_cp,v2_cp
integer :: i
!$acc reflected(v1,v2)
!$acc region local(v1_cp,v2_cp)
do i=1,n
v1_cp(i)=baser11(i)*v1(i)-baser12(i)*v2(i)
v2_cp(i)=baser12(i)*v1(i)+baser11(i)*v2(i)
v1(i)=v1_cp(i);
v2(i)=v2_cp(i);
enddo
!$acc end region
end subroutine base_change_vect_rec

If you can, look at using the mirror clause on the baser variables. These appear to be module allocatable arrays which are good candidates.

Hope this helps,
Mat