I am currently working on porting a large code with acc directives which consist of many part with typically the following structure:
Version A: !$acc region !init do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do ! first layer do i=1,N a(i,1)=0.1D0 end do ! vertical computation do k=2,nlev do i=1,N a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$acc end region
of course to get a better performance one needs here to take out the i loop:
Version B !$acc region do i=1,N !init do k=1,nlev a(i,k)=0.0D0 end do ! first layer a(i,1)=0.1D0 ! vertical computation do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$acc end region
For most of the code, only doing this loop reordering is enough to get good performance (for the most intensive part we do have to introduce some additional optimization like replacing local matrices with scalars …).
For the vast majority of the code where only this loop reordering is enough we would like to keep the same source code for the CPU and GPU version.
As said in one of the PGI insider, we could in principle have the code in version B, the compiler should be able to transform it in version A when compiling for a CPU. However we have seen that this is not always the case with all compiler and we would like to keep the source code in CPU optimize form (i.e. version A), as most users are targeting such architecture.
For the moment we are considering to have our own prepocess scripts to do this loop reordering. I was just wondering if you were considering to have such fonctionnality within the acc directives ? (Something like $acc swap (i,k) …)