I would like to know what is the best practice to optimize a very modular code that make use many time of the same subrotuines, performing some geometrical transofrmation. A proxy for the code might look as follows
module dyn_arr_gpu
!!! displacement, velocity & acc
real*4, dimension(:), allocatable :: u1,u2,u3,u1b,u2b,u3b
real*4, dimension(:), allocatable :: ur1,ur2,ur3,ur1b,ur2b,ur3b
!$acc mirror(u1,u2,u3,u1b,u2b,u3b,ur1,ur2,ur3,ur1b,ur2b,ur3b)
contains
subroutine sync_idx_rot()
!$acc update device(idx_rot,idx_norot)
end subroutine sync_idx_rot
subroutine sync_kernel_host()
!$acc update host(kernel1,kernel2,kernel3,kernel_fs)
end subroutine sync_kernel_host
subroutine sync_kernel_device()
!$acc update device(kernel1,kernel2,kernel3,kernel_fs)
end subroutine sync_kernel_device
end module
now the subroutine that perform some operation on the dynamic array just defined:
v1 v2 v1r v2r are dummy arguments that makes the routine applicable to different arrays
subroutine base_change_vect_rec(v1,v2,n,v1r,v2r)
implicit none
integer, intent(in) :: n
real*4,dimension(n) :: v1,v2,v1r,v2r,v1_cp,v2_cp
integer :: i
!$acc region
!$acc do private(v1_cp,v2_cp)
do i=1,n
v1_cp(i)=baser11(i)*v1(i)-baser12(i)*v2(i)
v2_cp(i)=baser12(i)*v1(i)+baser11(i)*v2(i)
v1r(i)=v1_cp(i);v2r(i)=v2_cp(i);
enddo
!$acc end region
end subroutine base_change_vect_rec
Finally the main code calling the subroutine:
do i=start_shot,nshots
call sync_kernel_device()
do j=1,i
call base_change_vect_rec(ur1,ur2,npts3d,ur1,ur2)
call base_change_vect_rec(ur1b,ur2b,npts3d,ur1b,ur2b)
enddo
call sync_kernel_host()
enddo
The code as it is now, spends a very long time moving data, and compilers gives create always copiin and copyout of the array.
real*4,dimension(n) :: v1,v2,v1r,v2r,v1_cp,v2_cp
This is natural since the compiler does not know that the quantity that will be used calling the subroutine are already in the device being defined as mirrored and not needed after the loop.
The gpu cluster i am using is under maintenance today and i cant access the compilers output and timing results. If necessary i can post it later.
How can i optimise the code and avoid all those data moving back and forth the device?
I am using to compile the 11.7 version!
Thanks in advance