Thanks for the help Mat and suggesting PCAST. For porting kernels in Fortran, I have typically been using KGEN from NCAR GitHub - NCAR/KGen: Fortran Kernel Generator but it sounds great to have something like that which does the CPU and GPU code simultaneously for comparison.
Here is the compiler output from a more complete example:
pc_hyp_mol_flux:
164, Generating update device(nvar)
165, Generating enter data copyin(flux3(:,:,:,:),hi(:),lo(:),v(:,:,:),q(:,:,:,:),flux2(:,:,:,:),flux1(:,:,:,:))
Generating enter data create(d(:,:,:,:))
Generating enter data copyin(ax(:,:,:))
167, Accelerator kernel generated
Generating Tesla code
168, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
169, ! blockidx%x threadidx%x collapsed
170, ! blockidx%x threadidx%x collapsed
171, !$acc loop seq
175, !$acc loop seq
167, Generating implicit present(ax(ilo1+1:ihi1,ilo2:ihi2,ilo3:ihi3),q(ilo1+1:ihi1,ilo2:ihi2,ilo3:ihi3,:1),flux1(ilo1+1:ihi1,ilo2:ihi2,ilo3:ihi3,:nvar))
171, Loop is parallelizable
175, Loop is parallelizable
183, Accelerator kernel generated
Generating Tesla code
184, !$acc loop gang, vector(128) collapse(4) ! blockidx%x threadidx%x
185, ! blockidx%x threadidx%x collapsed
186, ! blockidx%x threadidx%x collapsed
187, ! blockidx%x threadidx%x collapsed
183, Generating implicit present(v(lo-2:hi+2,lo-2:hi+2,lo-2:hi+2),flux3(lo-2:hi+2,lo-2:hi+2,lo-2:hi+3,:nvar),d(lo-2:hi+2,lo-2:hi+2,lo-2:hi+2,:nvar),flux1(lo-2:hi+3,lo-2:hi+2,lo-2:hi+2,:nvar),flux2(lo-2:hi+2,lo-2:hi+3,lo-2:hi+2,:nvar))
197, Generating exit data copyout(flux1(:,:,:,:))
Generating exit data delete(hi(:),lo(:),v(:,:,:),q(:,:,:,:),ax(:,:,:))
Generating exit data copyout(d(:,:,:,:))
And here is the more complete example as well with a few line number for reference. I’m really at a loss for what I’m missing with this one:
nextra = 3
ilo1=lo(1)-nextra
ilo2=lo(2)-nextra
ilo3=lo(3)-nextra
ihi1=hi(1)+nextra
ihi2=hi(2)+nextra
ihi3=hi(3)+nextra
do L=1,3
qt_lo(L) = lo(L) - nextra
qt_hi(L) = hi(L) + nextra
enddo
!$acc update device(nvar)
!$acc enter data create(d) copyin(flux1,flux2,flux3) copyin(v,ax,q,lo,hi)
!$acc parallel loop gang vector collapse(3) private(flux_tmp) default(present)
do k = ilo3, ihi3 !line 168
do j = ilo2, ihi2
do i = ilo1+1, ihi1
do ivar = 1, NVAR
flux_tmp(ivar) = 0.d0
enddo
flux_tmp(1) = q(i,j,k,1)
do ivar = 1, NVAR
flux1(i,j,k,ivar) = flux1(i,j,k,ivar) + flux_tmp(ivar) * ax(i,j,k)
enddo
enddo
enddo
enddo !line 180
!$acc end parallel
!$acc parallel loop gang vector collapse(4) default(present)
do ivar=1,NVAR
do k = lo(3)-nextra+1, hi(3)+nextra-1
do j = lo(2)-nextra+1, hi(2)+nextra-1
do i = lo(1)-nextra+1, hi(1)+nextra-1
d(i,j,k,ivar) = - (flux1(i+1,j,k,ivar) - flux1(i,j,k,ivar) &
+ flux2(i,j+1,k,ivar) - flux2(i,j,k,ivar) &
+ flux3(i,j,k+1,ivar) - flux3(i,j,k,ivar)) / v(i,j,k)
enddo
enddo
enddo
enddo
!$acc end parallel
!$acc exit data copyout(flux1,d) delete(v,ax,q,lo,hi)