Hello, everyone.
I am changing my cpu code with acc derectives.
While the loops of my subroutine seemed to have been parallelized, the results always fail. I have also tried other subroutines,still 5 of which got wrong data 4. I dont know why, even though I have read some good articles. I
d like to copy one of my subroutine here. And followed with the information messages:
subroutine getU12
! --------------------------------------------------------------------
! get ( uxp1,uyp1,uzp1 )
! get ( uxp2,uyp2,uzp2 )
! get ( Uxc,Uyc,Uzc )
! get ( Uxf,Uxu,Uys,Uyu,Uzs,Uzf )
! --------------------------------------------------------------------
include ‘lmhd-csm31.h’
!$acc data copyin(ux1,uy1,uz1,dy,dz), copyout(uxp1,uyp1,uzp1,uxp2,uyp2,uzp2,uxc,uyc,uzc,Uxf,Uxu,Uys,Uyu,Uzs,Uzf)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+1
do j = yele+js, je+yele+js
do i = is+1, ie+2
uxp1(i,j,k) = ( dy(j)*ux1(i,j-1,k) + dy(j-1)*ux1(i,j,k) ) &
/ ( dy(j-1) + dy(j) )
end do
end do
end do
!$acc end kernels
!: get Uyp1, Uxc, Uyc, Uzc
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+1
do j = yele+js, je+yele
do i = is+1, ie+2
uyp1(i,j,k) = ( uy1(i,j,k) + uy1(i-1,j,k) ) * c1f2
uxc(i,j,k) = ( ux1(i,j,k) + ux1(i+1,j,k) ) * c1f2
uyc(i,j,k) = ( uy1(i,j,k) + uy1(i,j+1,k) ) * c1f2
uzc(i,j,k) = ( uz1(i,j,k) + uz1(i,j,k+1) ) * c1f2
end do
end do
end do
!$acc end kernels
!!$OMP PARALLEL DO PRIVATE(i)
!$acc kernels DO PRIVATE(i,k)
do k = ks+1, ke+1
do i = is+1, ie+2
uyp1(i,je+yele+js,k) = ( uy1(i,je+yele+js,k) &
- uy1(i-1,je+yele+js,k) ) * c1f2
end do
end do
!$acc end kernels
!: get Uzp1
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+2
do j = yele+js, je+yele+js
do i = is+1, ie+1
uzp1(i,j,k) = ( dy(j)*uz1(i,j-1,k) + dy(j-1)*uz1(i,j,k) ) &
/ ( dy(j-1) + dy(j) )
end do
end do
end do
!$acc end kernels
!: get Uxp2, Uzp2
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+2
do j = yele+js, je+yele
do i = is+1, ie+2
uxp2(i,j,k) = ( dz(k)*ux1(i,j,k-1) + dz(k-1)*ux1(i,j,k) ) &
/ ( dz(k-1) + dz(k) )
uzp2(i,j,k) = ( uz1(i,j,k) + uz1(i-1,j,k) ) * c1f2
end do
end do
end do
!$acc end kernels
!: get Uyp2
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+2
do j = yele+js, je+yele+js
do i = is+1, ie+1
uyp2(i,j,k) = ( dz(k)*uy1(i,j,k-1) + dz(k-1)*uy1(i,j,k) ) &
/ ( dz(k-1) + dz(k) )
end do
end do
end do
!$acc end kernels
!: get Uxf, Uzf
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+1
do j = yele+js+1, je+yele
do i = is+1, ie+1
uxf(i,j,k) = ( dy(j-1)*uxc(i,j,k) + dy(j)*uxc(i,j-1,k) ) &
/ ( dy(j) + dy(j-1))
uzf(i,j,k) = ( dy(j-1)*uzc(i,j,k) + dy(j)*uzc(i,j-1,k) ) &
/ ( dy(j) + dy(j-1) )
end do
end do
end do
!$acc end kernels
!!$OMP PARALLEL DO PRIVATE(i)
!$acc kernels DO PRIVATE(i,k)
do k = ks+1, ke+1
do i = is+1, ie+1
uxf(i,yele+js,k) = c0
uxf(i,je+yele+js,k) = c0
uzf(i,yele+js,k) = c0
uzf(i,je+yele+js,k) = c0
end do
end do
!$acc end kernels
!: get Uxu, Uyu
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+2, ke+1
do j = yele+js, je+yele
do i = is+1, ie+1
uxu(i,j,k) = ( dz(k)*uxc(i,j,k-1) + dz(k-1)*uxc(i,j,k) ) &
/ ( dz(k-1) + dz(k) )
uyu(i,j,k) = ( dz(k)*uyc(i,j,k-1) + dz(k-1)*uyc(i,j,k) ) &
/ ( dz(k-1) + dz(k) )
end do
end do
end do
!$acc end kernels
!!$OMP PARALLEL DO PRIVATE(i)
!$acc kernels DO PRIVATE(i,j)
do j = yele+js, je+yele
do i = is+1, ie+1
uxu(i,j,ks+1) = c0
uxu(i,j,ke+2) = c0
uyu(i,j,ks+1) = c0
uyu(i,j,ke+2) = c0
end do
end do
!$acc end kernels
!: get Uys, Uzs
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+1
do j = yele+js, je+yele
do i = is+1, ie+2
uys(i,j,k) = ( uyc(i,j,k) + uyc(i-1,j,k) ) * c1f2
uzs(i,j,k) = ( uzc(i,j,k) + uzc(i-1,j,k) ) * c1f2
end do
end do
end do
!$acc end kernels
!$acc end data
return
end
getu12
1864, Generating copyout(uzf(:,:,:))
Generating copyout(uzs(:,:,:))
Generating copyout(uyu(:,:,:))
Generating copyout(uys(:,:,:))
Generating copyout(uxu(:,:,:))
Generating copyout(uxf(:,:,:))
Generating copyout(uzc(:,:,:))
Generating copyout(uyc(:,:,:))
Generating copyout(uxc(:,:,:))
Generating copyout(uzp2(:,:,:))
Generating copyout(uyp2(:,:,:))
Generating copyout(uxp2(:,:,:))
Generating copyout(uzp1(:,:,:))
Generating copyout(uyp1(:,:,:))
Generating copyout(uxp1(:,:,:))
Generating copyin(dz(:))
Generating copyin(dy(:))
Generating copyin(uz1(:,:,:))
Generating copyin(uy1(:,:,:))
Generating copyin(ux1(:,:,:))
1869, Generating present_or_copyin(dy(:))
Generating present_or_copyin(ux1(:,:,:))
Generating present_or_copyout(uxp1(:,:,:))
Generating compute capability 2.0 binary
1870, Loop is parallelizable
1871, Loop is parallelizable
1872, Loop is parallelizable
Accelerator kernel generated
1870, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘ux1’
1871, !$acc loop vector(4) ! threadidx%y
1872, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 26 registers; 0 shared, 108 constant, 0 local memory bytes
1915, Generating present_or_copyout(uzc(:,:,:))
Generating present_or_copyin(uz1(:,:,:))
Generating present_or_copyout(uyc(:,:,:))
Generating present_or_copyin(uy1(:,:,:))
Generating present_or_copyout(uxc(:,:,:))
Generating present_or_copyin(ux1(:,:,:))
Generating present_or_copyout(uyp1(:,:,:))
Generating compute capability 2.0 binary
1916, Loop is parallelizable
1917, Loop is parallelizable
1918, Loop is parallelizable
Accelerator kernel generated
1916, !$acc loop vector(4) ! threadidx%y
1917, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uz1’
Cached references to size [(x+1)x(y)] block of ‘ux1’
Cached references to size [(x+1)x2x(y)] block of ‘uy1’
1918, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 32 registers; 0 shared, 136 constant, 0 local memory bytes
1928, Generating present_or_copyin(uy1(:,:,:))
Generating present_or_copyout(uyp1(:,:,:))
Generating compute capability 2.0 binary
1929, Loop is parallelizable
1930, Loop is parallelizable
Accelerator kernel generated
1929, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
Cached references to size [(x+1)x(y)] block of ‘uy1’
1930, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 20 registers; 0 shared, 88 constant, 0 local memory bytes
1939, Generating present_or_copyin(dy(:))
Generating present_or_copyin(uz1(:,:,:))
Generating present_or_copyout(uzp1(:,:,:))
Generating compute capability 2.0 binary
1940, Loop is parallelizable
1941, Loop is parallelizable
1942, Loop is parallelizable
Accelerator kernel generated
1940, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uz1’
1941, !$acc loop vector(4) ! threadidx%y
1942, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 26 registers; 0 shared, 108 constant, 0 local memory bytes
1986, Generating present_or_copyin(ux1(:,:,:))
Generating present_or_copyin(dz(:))
Generating present_or_copyout(uxp2(:,:,:))
Generating present_or_copyin(uz1(:,:,:))
Generating present_or_copyout(uzp2(:,:,:))
Generating compute capability 2.0 binary
1987, Loop is parallelizable
1988, Loop is parallelizable
1989, Loop is parallelizable
Accelerator kernel generated
1987, !$acc loop vector(4) ! threadidx%y
1988, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘ux1’
Cached references to size [(x+1)x(y)] block of ‘uz1’
1989, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 28 registers; 0 shared, 136 constant, 0 local memory bytes
2028, Generating present_or_copyin(uy1(:,:,:))
Generating present_or_copyin(dz(:))
Generating present_or_copyout(uyp2(:,:,:))
Generating compute capability 2.0 binary
2029, Loop is parallelizable
2030, Loop is parallelizable
2031, Loop is parallelizable
Accelerator kernel generated
2029, !$acc loop vector(4) ! threadidx%y
2030, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uy1’
2031, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 26 registers; 0 shared, 108 constant, 0 local memory bytes
2066, Generating present_or_copyin(dy(:))
Generating present_or_copyout(uxc(:,:,:))
Generating present_or_copyout(uxf(:,:,:))
Generating present_or_copyout(uzc(:,:,:))
Generating present_or_copyout(uzf(:,:,:))
Generating compute capability 2.0 binary
2067, Loop is parallelizable
2068, Loop is parallelizable
2069, Loop is parallelizable
Accelerator kernel generated
2067, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uxc’
Cached references to size [(x)x(y+1)] block of ‘uzc’
2068, !$acc loop vector(4) ! threadidx%y
2069, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 40 registers; 0 shared, 124 constant, 0 local memory bytes
2108, Generating present_or_copyout(uxf(:,:,:))
Generating present_or_copyout(uzf(:,:,:))
Generating compute capability 2.0 binary
2109, Loop is parallelizable
2110, Loop is parallelizable
Accelerator kernel generated
2109, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
2110, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 21 registers; 0 shared, 84 constant, 0 local memory bytes
2122, Generating present_or_copyin(dz(:))
Generating present_or_copyout(uxc(:,:,:))
Generating present_or_copyout(uxu(:,:,:))
Generating present_or_copyout(uyc(:,:,:))
Generating present_or_copyout(uyu(:,:,:))
Generating compute capability 2.0 binary
2123, Loop is parallelizable
2124, Loop is parallelizable
2125, Loop is parallelizable
Accelerator kernel generated
2123, !$acc loop vector(4) ! threadidx%y
2124, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uxc’
Cached references to size [(x)x(y+1)] block of ‘uyc’
2125, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 40 registers; 0 shared, 124 constant, 0 local memory bytes
2164, Generating present_or_copyout(uxu(:,:,:))
Generating present_or_copyout(uyu(:,:,:))
Generating compute capability 2.0 binary
2165, Loop is parallelizable
2166, Loop is parallelizable
Accelerator kernel generated
2165, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
2166, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 21 registers; 0 shared, 80 constant, 0 local memory bytes
2177, Generating present_or_copyout(uyc(:,:,:))
Generating present_or_copyout(uys(:,:,:))
Generating present_or_copyout(uzc(:,:,:))
Generating present_or_copyout(uzs(:,:,:))
Generating compute capability 2.0 binary
2178, Loop is parallelizable
2179, Loop is parallelizable
2180, Loop is parallelizable
Accelerator kernel generated
2178, Cached references to size [(x+1)x(y)] block of ‘uyc’
Cached references to size [(x+1)x(y)] block of ‘uzc’
2179, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
2180, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 34 registers; 0 shared, 112 constant, 0 local memory bytes
The strange result is some data will be much bigger than the cpu code computing before. Does anyone knows why?Thanks anyway.