Hi,
I am currently working on porting a large code with acc directives which consist of many part with typically the following structure:
Version A:
!$acc region
!init
do k=1,nlev
do i=1,N
a(i,k)=0.0D0
end do
end do
! first layer
do i=1,N
a(i,1)=0.1D0
end do
! vertical computation
do k=2,nlev
do i=1,N
a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k)
end do
end do
!$acc end region
of course to get a better performance one needs here to take out the i loop:
Version B
!$acc region
do i=1,N
!init
do k=1,nlev
a(i,k)=0.0D0
end do
! first layer
a(i,1)=0.1D0
! vertical computation
do k=2,nlev
a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k)
end do
end do
!$acc end region
For most of the code, only doing this loop reordering is enough to get good performance (for the most intensive part we do have to introduce some additional optimization like replacing local matrices with scalars …).
For the vast majority of the code where only this loop reordering is enough we would like to keep the same source code for the CPU and GPU version.
As said in one of the PGI insider, we could in principle have the code in version B, the compiler should be able to transform it in version A when compiling for a CPU. However we have seen that this is not always the case with all compiler and we would like to keep the source code in CPU optimize form (i.e. version A), as most users are targeting such architecture.
For the moment we are considering to have our own prepocess scripts to do this loop reordering. I was just wondering if you were considering to have such fonctionnality within the acc directives ? (Something like $acc swap (i,k) …)
Thanks,
Xavier