Loops seemed to have been parallelized but the result fail

Hello, everyone.
I am changing my cpu code with acc derectives.
While the loops of my subroutine seemed to have been parallelized, the results always fail. I have also tried other subroutines,still 5 of which got wrong data 4. I dont know why, even though I have read some good articles. Id like to copy one of my subroutine here. And followed with the information messages:

subroutine getU12
! --------------------------------------------------------------------
! get ( uxp1,uyp1,uzp1 )
! get ( uxp2,uyp2,uzp2 )
! get ( Uxc,Uyc,Uzc )
! get ( Uxf,Uxu,Uys,Uyu,Uzs,Uzf )
! --------------------------------------------------------------------
include ‘lmhd-csm31.h’

!$acc data copyin(ux1,uy1,uz1,dy,dz), copyout(uxp1,uyp1,uzp1,uxp2,uyp2,uzp2,uxc,uyc,uzc,Uxf,Uxu,Uys,Uyu,Uzs,Uzf)

!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+1
do j = yele+js, je+yele+js
do i = is+1, ie+2
uxp1(i,j,k) = ( dy(j)*ux1(i,j-1,k) + dy(j-1)*ux1(i,j,k) ) &
/ ( dy(j-1) + dy(j) )
end do
end do
end do
!$acc end kernels

!: get Uyp1, Uxc, Uyc, Uzc
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+1
do j = yele+js, je+yele
do i = is+1, ie+2
uyp1(i,j,k) = ( uy1(i,j,k) + uy1(i-1,j,k) ) * c1f2
uxc(i,j,k) = ( ux1(i,j,k) + ux1(i+1,j,k) ) * c1f2
uyc(i,j,k) = ( uy1(i,j,k) + uy1(i,j+1,k) ) * c1f2
uzc(i,j,k) = ( uz1(i,j,k) + uz1(i,j,k+1) ) * c1f2
end do
end do
end do
!$acc end kernels
!!$OMP PARALLEL DO PRIVATE(i)
!$acc kernels DO PRIVATE(i,k)
do k = ks+1, ke+1
do i = is+1, ie+2
uyp1(i,je+yele+js,k) = ( uy1(i,je+yele+js,k) &

  • uy1(i-1,je+yele+js,k) ) * c1f2
    end do
    end do
    !$acc end kernels
    !: get Uzp1
    !!$OMP PARALLEL DO PRIVATE(i,j)
    !$acc kernels DO PRIVATE(i,j,k)
    do k = ks+1, ke+2
    do j = yele+js, je+yele+js
    do i = is+1, ie+1
    uzp1(i,j,k) = ( dy(j)*uz1(i,j-1,k) + dy(j-1)*uz1(i,j,k) ) &
    / ( dy(j-1) + dy(j) )
    end do
    end do
    end do
    !$acc end kernels

!: get Uxp2, Uzp2
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+2
do j = yele+js, je+yele
do i = is+1, ie+2
uxp2(i,j,k) = ( dz(k)*ux1(i,j,k-1) + dz(k-1)*ux1(i,j,k) ) &
/ ( dz(k-1) + dz(k) )
uzp2(i,j,k) = ( uz1(i,j,k) + uz1(i-1,j,k) ) * c1f2
end do
end do
end do
!$acc end kernels

!: get Uyp2
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+2
do j = yele+js, je+yele+js
do i = is+1, ie+1
uyp2(i,j,k) = ( dz(k)*uy1(i,j,k-1) + dz(k-1)*uy1(i,j,k) ) &
/ ( dz(k-1) + dz(k) )
end do
end do
end do
!$acc end kernels

!: get Uxf, Uzf
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+1
do j = yele+js+1, je+yele
do i = is+1, ie+1
uxf(i,j,k) = ( dy(j-1)*uxc(i,j,k) + dy(j)*uxc(i,j-1,k) ) &
/ ( dy(j) + dy(j-1))
uzf(i,j,k) = ( dy(j-1)*uzc(i,j,k) + dy(j)*uzc(i,j-1,k) ) &
/ ( dy(j) + dy(j-1) )
end do
end do
end do
!$acc end kernels

!!$OMP PARALLEL DO PRIVATE(i)
!$acc kernels DO PRIVATE(i,k)
do k = ks+1, ke+1
do i = is+1, ie+1
uxf(i,yele+js,k) = c0
uxf(i,je+yele+js,k) = c0
uzf(i,yele+js,k) = c0
uzf(i,je+yele+js,k) = c0
end do
end do
!$acc end kernels

!: get Uxu, Uyu
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+2, ke+1
do j = yele+js, je+yele
do i = is+1, ie+1
uxu(i,j,k) = ( dz(k)*uxc(i,j,k-1) + dz(k-1)*uxc(i,j,k) ) &
/ ( dz(k-1) + dz(k) )
uyu(i,j,k) = ( dz(k)*uyc(i,j,k-1) + dz(k-1)*uyc(i,j,k) ) &
/ ( dz(k-1) + dz(k) )
end do
end do
end do
!$acc end kernels

!!$OMP PARALLEL DO PRIVATE(i)
!$acc kernels DO PRIVATE(i,j)
do j = yele+js, je+yele
do i = is+1, ie+1
uxu(i,j,ks+1) = c0
uxu(i,j,ke+2) = c0
uyu(i,j,ks+1) = c0
uyu(i,j,ke+2) = c0
end do
end do
!$acc end kernels

!: get Uys, Uzs
!!$OMP PARALLEL DO PRIVATE(i,j)
!$acc kernels DO PRIVATE(i,j,k)
do k = ks+1, ke+1
do j = yele+js, je+yele
do i = is+1, ie+2
uys(i,j,k) = ( uyc(i,j,k) + uyc(i-1,j,k) ) * c1f2
uzs(i,j,k) = ( uzc(i,j,k) + uzc(i-1,j,k) ) * c1f2
end do
end do
end do
!$acc end kernels
!$acc end data
return
end


getu12
1864, Generating copyout(uzf(:,:,:))
Generating copyout(uzs(:,:,:))
Generating copyout(uyu(:,:,:))
Generating copyout(uys(:,:,:))
Generating copyout(uxu(:,:,:))
Generating copyout(uxf(:,:,:))
Generating copyout(uzc(:,:,:))
Generating copyout(uyc(:,:,:))
Generating copyout(uxc(:,:,:))
Generating copyout(uzp2(:,:,:))
Generating copyout(uyp2(:,:,:))
Generating copyout(uxp2(:,:,:))
Generating copyout(uzp1(:,:,:))
Generating copyout(uyp1(:,:,:))
Generating copyout(uxp1(:,:,:))
Generating copyin(dz(:))
Generating copyin(dy(:))
Generating copyin(uz1(:,:,:))
Generating copyin(uy1(:,:,:))
Generating copyin(ux1(:,:,:))
1869, Generating present_or_copyin(dy(:))
Generating present_or_copyin(ux1(:,:,:))
Generating present_or_copyout(uxp1(:,:,:))
Generating compute capability 2.0 binary
1870, Loop is parallelizable
1871, Loop is parallelizable
1872, Loop is parallelizable
Accelerator kernel generated
1870, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘ux1’
1871, !$acc loop vector(4) ! threadidx%y
1872, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 26 registers; 0 shared, 108 constant, 0 local memory bytes
1915, Generating present_or_copyout(uzc(:,:,:))
Generating present_or_copyin(uz1(:,:,:))
Generating present_or_copyout(uyc(:,:,:))
Generating present_or_copyin(uy1(:,:,:))
Generating present_or_copyout(uxc(:,:,:))
Generating present_or_copyin(ux1(:,:,:))
Generating present_or_copyout(uyp1(:,:,:))
Generating compute capability 2.0 binary
1916, Loop is parallelizable
1917, Loop is parallelizable
1918, Loop is parallelizable
Accelerator kernel generated
1916, !$acc loop vector(4) ! threadidx%y
1917, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uz1’
Cached references to size [(x+1)x(y)] block of ‘ux1’
Cached references to size [(x+1)x2x(y)] block of ‘uy1’
1918, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 32 registers; 0 shared, 136 constant, 0 local memory bytes
1928, Generating present_or_copyin(uy1(:,:,:))
Generating present_or_copyout(uyp1(:,:,:))
Generating compute capability 2.0 binary
1929, Loop is parallelizable
1930, Loop is parallelizable
Accelerator kernel generated
1929, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
Cached references to size [(x+1)x(y)] block of ‘uy1’
1930, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 20 registers; 0 shared, 88 constant, 0 local memory bytes
1939, Generating present_or_copyin(dy(:))
Generating present_or_copyin(uz1(:,:,:))
Generating present_or_copyout(uzp1(:,:,:))
Generating compute capability 2.0 binary
1940, Loop is parallelizable
1941, Loop is parallelizable
1942, Loop is parallelizable
Accelerator kernel generated
1940, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uz1’
1941, !$acc loop vector(4) ! threadidx%y
1942, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 26 registers; 0 shared, 108 constant, 0 local memory bytes
1986, Generating present_or_copyin(ux1(:,:,:))
Generating present_or_copyin(dz(:))
Generating present_or_copyout(uxp2(:,:,:))
Generating present_or_copyin(uz1(:,:,:))
Generating present_or_copyout(uzp2(:,:,:))
Generating compute capability 2.0 binary
1987, Loop is parallelizable
1988, Loop is parallelizable
1989, Loop is parallelizable
Accelerator kernel generated
1987, !$acc loop vector(4) ! threadidx%y
1988, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘ux1’
Cached references to size [(x+1)x(y)] block of ‘uz1’
1989, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 28 registers; 0 shared, 136 constant, 0 local memory bytes
2028, Generating present_or_copyin(uy1(:,:,:))
Generating present_or_copyin(dz(:))
Generating present_or_copyout(uyp2(:,:,:))
Generating compute capability 2.0 binary
2029, Loop is parallelizable
2030, Loop is parallelizable
2031, Loop is parallelizable
Accelerator kernel generated
2029, !$acc loop vector(4) ! threadidx%y
2030, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uy1’
2031, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 26 registers; 0 shared, 108 constant, 0 local memory bytes
2066, Generating present_or_copyin(dy(:))
Generating present_or_copyout(uxc(:,:,:))
Generating present_or_copyout(uxf(:,:,:))
Generating present_or_copyout(uzc(:,:,:))
Generating present_or_copyout(uzf(:,:,:))
Generating compute capability 2.0 binary
2067, Loop is parallelizable
2068, Loop is parallelizable
2069, Loop is parallelizable
Accelerator kernel generated
2067, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uxc’
Cached references to size [(x)x(y+1)] block of ‘uzc’
2068, !$acc loop vector(4) ! threadidx%y
2069, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 40 registers; 0 shared, 124 constant, 0 local memory bytes
2108, Generating present_or_copyout(uxf(:,:,:))
Generating present_or_copyout(uzf(:,:,:))
Generating compute capability 2.0 binary
2109, Loop is parallelizable
2110, Loop is parallelizable
Accelerator kernel generated
2109, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
2110, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 21 registers; 0 shared, 84 constant, 0 local memory bytes
2122, Generating present_or_copyin(dz(:))
Generating present_or_copyout(uxc(:,:,:))
Generating present_or_copyout(uxu(:,:,:))
Generating present_or_copyout(uyc(:,:,:))
Generating present_or_copyout(uyu(:,:,:))
Generating compute capability 2.0 binary
2123, Loop is parallelizable
2124, Loop is parallelizable
2125, Loop is parallelizable
Accelerator kernel generated
2123, !$acc loop vector(4) ! threadidx%y
2124, !$acc loop gang ! blockidx%y
Cached references to size [(x)x(y+1)] block of ‘uxc’
Cached references to size [(x)x(y+1)] block of ‘uyc’
2125, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 40 registers; 0 shared, 124 constant, 0 local memory bytes
2164, Generating present_or_copyout(uxu(:,:,:))
Generating present_or_copyout(uyu(:,:,:))
Generating compute capability 2.0 binary
2165, Loop is parallelizable
2166, Loop is parallelizable
Accelerator kernel generated
2165, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
2166, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 21 registers; 0 shared, 80 constant, 0 local memory bytes
2177, Generating present_or_copyout(uyc(:,:,:))
Generating present_or_copyout(uys(:,:,:))
Generating present_or_copyout(uzc(:,:,:))
Generating present_or_copyout(uzs(:,:,:))
Generating compute capability 2.0 binary
2178, Loop is parallelizable
2179, Loop is parallelizable
2180, Loop is parallelizable
Accelerator kernel generated
2178, Cached references to size [(x+1)x(y)] block of ‘uyc’
Cached references to size [(x+1)x(y)] block of ‘uzc’
2179, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
2180, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
CC 2.0 : 34 registers; 0 shared, 112 constant, 0 local memory bytes
The strange result is some data will be much bigger than the cpu code computing before. Does anyone knows why?Thanks anyway.

[0]
I would like to say that the mistakes above seem to be not so obvious. Since some of them (or only part of one output would be strange)will only be detected by my subroutines followed.

[1]
I have tried both the !$acc kernel and !$acc region. Neither will get the right answer. Are there any difference between them?

[2]
You see I have used the !$acc data copy ... lines. However if I don`t use it and I will get a even further exactly explainational message about copy in and out. Does it mean I have no bother to write it by myself?

[3]
Thank you everyone who write some good example simple enouth to be understand. While actual code is never short. I really appreciate who would write a longer example with much more explanation. Just like the article about MC from Mat or even bigger.

Hi Kevin,

I would like to say that the mistakes above seem to be not so obvious. Since some of them (or only part of one output would be strange)will only be detected by my subroutines followed.

What happens if you change the data region’s “copyout” clauses to “copy”? The copyout will overwrite the entire host array with the data from the GPU. However, it doesn’t look like your loops assign values to all elements of these arrays, hence undefined values are being copied back. Using “copy” will initialize your arrays first and ensure not undefined values will be copied back.

I have tried both the !$acc kernel and !$acc region. Neither will get the right answer. Are there any difference between them?

They are basically the same thing. “kernels” is the OpenACC spelling while “region” is the PGI Accelerator Model spelling.

You see I have used the !$acc data copy ... lines. However if I don`t use it and I will get a even further exactly explainational message about copy in and out. Does it mean I have no bother to write it by myself?

You don’t need to use the data statements since the compiler can in most cases. However, I suspect that the compiler will only copy over the interior of the arrays since it defaults to coping the minimal amount of data. This would mean that multiple small chunks will be copied causing a slow down of performance. Hence, I’d keep the data region and tell the compiler to copy over the whole array.

Hope this helps,
Mat

Hi, Mat

Thank you very much for your kind help. After changing the copyout to copy, I got the right value.