data corruption issue

I have something really baffling going on, both in 15.7 and 15.10. Not sure whether it’s a PGI issue or something in the surrounding code, but let’s say at least the fact that the $acc update call seemingly isn’t doing what it’s being told is really irking me and I’m running out of ideas here. The second print statement prints the same, no matter whether the update directives are there or not. Commenting code inside the kernel doesn’t seem to help either. Do you have an idea on how to proceed from here? I’m afraid I didn’t have time yet for a full reproducer.

code

!$acc update host(jd_pf)
print *, "jd_pf@10,1,1 before second common_vars kernel", jd_pf( 10, 1, 1 )
!$acc update device(jd_pf)

!$acc kernels present(dens_ptb_v_in) present(dens_ref_f) present(pres_ptb) present(dens_ptb) &
!$acc& present(rdensjd) present(mptmp) present(ptemp) present(exner) present(rmpt_ref_f) present(temp) present(rmpt_ptb_v_in) &
!$acc& present(pres_ref_f) present(jd_pf) present(rqa_v_in) present(qa)

!$acc loop independent vector(16)
 do j=0,ny+1
!$acc loop independent vector(16)
  do i=0,nx+1
!$acc loop seq
   do k = nz_mn, nz_mx
    if (i .eq. 10 .and. j .eq. 1 .and. k .eq. 1) then
      print *, "jd_pf@10,1,1 inside second common_vars kernel", jd_pf( 10, 1, 1 )
    end if
    rmpt_ptb = rmpt_ptb_v_in( i, j, k )* jd_pf( i, j, k )
    rmpt = rmpt_ref_f( i, j, k )+ rmpt_ptb
    exner( i, j, k )= ( rmpt * gasr * vp0 ) ** rdvcprd
    pres_ptb( i, j, k )= gasr * exner( i, j, k )* rmpt - pres_ref_f( i, j, k )
    rdens = 1._rp / ( dens_ref_f( i, j, k )+ dens_ptb( i, j, k ))
    coef = 1._rp / ( 1.0d0 + r608 * qa( i, j, k, id_qv )- qa( i, j, k, id_qc )- qa( &
    & i, j, k, id_qr )- qa( i, j, k, id_qi )- qa( i, j, k, id_qs )- qa( &
    & i, j, k, id_qg )- qa( i, j, k, id_qh ))
    temp( i, j, k )= ( pres_ptb( i, j, k )+ pres_ref_f( i, j, k )) * vrd * rdens * coef
   end do
  end do
 end do
!$acc end kernels

output

jd_pf@10,1,1 before second common_vars kernel   9.9999999999999995E-007
 jd_pf@10,1,1 inside second common_vars kernel   8.8888899999999995E-007

compiler call

pgf90 -g -O0 -Mchkptr -Kieee  -Minfo=accel,inline,ipa -Mneginfo -Minform=inform -acc -Mcuda=6.0,cc3x -ta=tesla:cc3x,keepgpu,keepbin,time -Minline=levels:5,reshape -DGPU -byteswapio -Mmpi=mpich -I /home/michel/asuca/hybrid/Nusdas13/src -I //home/michel/lib/netcdf3/include -DGPU

Hi Michel,

I’m not really sure what’s going on but I’ll takes some guesses.

What data type is “jd_pf”? If it’s a pointer, I wondering if the compiler doesn’t know it’s size so isn’t copying the whole array over. What do the compiler feedback messages (-Minfo=accel) say about this line?

  • Mat

Hey Mat

That would make sense, yes. jd_pf is a module import, so it lands in the subroutine in question using

use metrics, only: jd_pf

In metrics, it is declared at module level using

real(rp), public, allocatable :: jd_pf(:,:,:)

and allocated later using

allocate( jd_pf( nx_mn:nx_mx, ny_mn:ny_mx, nz_mn:nz_mx ))

.

The OpenACC data region basically just specifies

!$acc enter data copyin(jd_pf)

, which is executed after the the allocation. In general, the runtime should know at that point, what size jd_pf is, right? I’ve never tried to oversteer that, but I guess I will try this next.

it’s getting weirder. Now, in the same codebase using 15.10, calling the following subroutine the second time…

 subroutine diag_u( vel_x, vel_y, vel_z, mom_x_v, mom_y_v, mom_z_v, dens_ptb )
  use openacc
  use cudafor
  use nrtype, only : rp
  use prm, only : nz_mn, nz_mx, nx_mn, nx_mx, ny_mn, ny_mx
  use ref, only : dens_ref_f, dens_ref_f_x, dens_ref_f_y
  use metrics, only : jd_uf, jd_vf, jd_ph
  implicit none
  real(rp), intent(out):: vel_x(nx_mn:nx_mx, ny_mn:ny_mx, nz_mn:nz_mx)
  real(rp), intent(out):: vel_y(nx_mn:nx_mx, ny_mn:ny_mx, nz_mn:nz_mx)
  real(rp), intent(out):: vel_z(nx_mn:nx_mx, ny_mn:ny_mx, nz_mn:nz_mx)
  real(rp), intent(in):: mom_x_v(nx_mn:nx_mx, ny_mn:ny_mx, nz_mn:nz_mx)
  real(rp), intent(in):: mom_y_v(nx_mn:nx_mx, ny_mn:ny_mx, nz_mn:nz_mx)
  real(rp), intent(in):: mom_z_v(nx_mn:nx_mx, ny_mn:ny_mx, nz_mn:nz_mx)
  real(rp), intent(in):: dens_ptb(nx_mn:nx_mx, ny_mn:ny_mx, nz_mn:nz_mx)
  real(rp) :: dens_x(nx_mn:nx_mx-1, ny_mn:ny_mx, nz_mn:nz_mx)
  real(rp) :: dens_y(nx_mn:nx_mx, ny_mn:ny_mx-1, nz_mn:nz_mx)
  real(rp) :: dens_z(nx_mn:nx_mx, ny_mn:ny_mx, nz_mn:nz_mx)
  integer(4) :: k
  real(8) :: hf_output_temp
  integer(4) :: i, j
  integer(4) :: hf_symbols_are_device_present
  hf_symbols_are_device_present = acc_is_present(vel_z)
!$acc enter data create(dens_z), create(dens_y), create(dens_x) if(hf_symbols_are_device_present)
!$acc exit data delete(dens_z), delete(dens_y), delete(dens_x) if(hf_symbols_are_device_present)
end subroutine diag_u

… causes

FATAL ERROR: variable in data clause was already present on device 1: name=dens_x

.

This call is right after the previously reported call that prints corrupted data, only this time with explicitely stated array size for the jd_pf update directives. (which doesn’t change anything in what’s printed btw. Something seems quite broken.

In general, the runtime should know at that point, what size jd_pf is, right?

Yes, since it’s an allocatable it should have a F90 description with the rank and size information. If were a pointer or passed as an F77 argument, then this information could be lost, but that doesn’t appear to be the case here.

!$acc enter data copyin(jd_pf)

What does the -Minfo=accel compiler feedback message say is being copied? What are the messages when you do the update?

use metrics, only: jd_pf

What happens if you take off the “only” in use the whole module?

This call is right after the previously reported call that prints corrupted data, only this time with explicitely stated array size for the jd_pf update directives. (which doesn’t change anything in what’s printed btw. Something seems quite broken.

I’m not sure what’s happening here. Maybe there’s something wrong with the present table? Maybe there’s data corruption in your program?

Can you send me the code and a dataset so I can investigate?

  • Mat

Hi Michel,

I just investigated an issue from another user (http://www.pgroup.com/userforum/viewtopic.php?t=5035&start=5) where values printed from the GPU were getting converted incorrect for certain hex values. The data itself was fine, but what was printed was incorrect.

I’m wondering if something similar is occurring here? Can you try performing a sum reduction of jd_pf on the host and device? If the sums match, then there’s a good chance that this is the same conversion problem with “print” and not your data.

  • Mat