Wrong results when using the private directive with PGI 12.6

Hi,

Trying to run OpenACC code compiled with PGI 12.6, I am getting wrong results. After investigation it seems related to the private directive. I have been able to reproduce the problem in the following test code:

program main
  implicit none
  integer*4, parameter :: ireals=8 
  integer*4 :: N,nlev,ip,k,kspec
  real*8, allocatable :: sohr_ref(:,:),thhr_ref(:,:)
  real*8, allocatable :: sohr(:,:),thhr(:,:),zfls(:,:),zflt(:,:),zsmu0(:),dp0(:,:)
  real*8 :: cp_d,zfac,g,zepemu

  N=1E3
  nlev=4

allocate(sohr_ref(N,nlev),thhr_ref(N,nlev)) 
allocate(sohr(N,nlev),thhr(N,nlev),zfls(N,nlev+1),zflt(N,nlev+1),zsmu0(N),dp0(N,nlev))

!----------------------------------
!init
cp_d     = 1005.0_ireals
g        =     9.80665_ireals
zepemu = 1.0E-9_ireals  !
DO  k = 1, nlev 
   DO  ip = 1,N
      zfls(ip,k)=8.0_ireals*cos(6.28_ireals*REAL(ip+k,ireals)/REAL(N+nlev,ireals))
      zflt(ip,k)=6.0_ireals*cos(6.28_ireals*REAL(ip+k,ireals)/REAL(N+nlev,ireals))
      dp0(ip,k) = 1015.0_ireals*cos(6.28_ireals*REAL(ip+k,ireals)/REAL(N+nlev,ireals))
   END DO
END DO

DO  ip = 1,N
   zfls(ip,nlev+1)=8.0_ireals*cos(6.28_ireals*REAL(ip+nlev+1,ireals)/REAL(N+nlev+1,ireals))
   zflt(ip,nlev+1)=6.0_ireals*cos(6.28_ireals*REAL(ip+nlev+1,ireals)/REAL(N+nlev+1,ireals))
   zsmu0(ip)=cos(6.28_ireals*REAL(ip,ireals)/REAL(N,ireals))
END DO
!----------------------------------
!1: compute on cpu
DO  k = 1, nlev
   DO  ip = 1,N
      zfac = g/(cp_d*dp0 (ip,k))
      sohr(ip,k) = 0.0
      IF (zsmu0(ip) > zepemu) THEN
         sohr_ref(ip,k) = zfac * (zfls(ip,k)-zfls(ip,k+1))
      ENDIF
      thhr_ref(ip,k)   = zfac * (zflt(ip,k)-zflt(ip,k+1))
   ENDDO
END DO


!----------------------------------
!2: compute on gpu without private
!$acc data create(sohr,thhr,dp0,zfls,zflt,zsmu0)
!$acc update device(dp0,zfls,zflt,zsmu0)
!$acc parallel 
DO  k = 1, nlev
   !$acc loop gang vector
   DO  ip = 1,N
      zfac = g/(cp_d*dp0 (ip,k))
      sohr(ip,k) = 0.0
      IF (zsmu0(ip) > zepemu) THEN
         sohr(ip,k) = zfac * (zfls(ip,k)-zfls(ip,k+1))
      ENDIF
      thhr(ip,k)   = zfac * (zflt(ip,k)-zflt(ip,k+1))
   ENDDO
END DO
!$acc end parallel 
!$acc update host(sohr,thhr)
!$acc end data

print*, 'Max Diff without private CPU/GPU: sohr', maxval(abs(sohr_ref-sohr))
print*, 'Max Diff without private CPU/GPU: thhr', maxval(abs(thhr_ref-thhr))

!----------------------------------
!3: compute on gpu using private

!$acc data create(sohr,thhr,dp0,zfls,zflt,zsmu0)
!$acc update device(dp0,zfls,zflt,zsmu0)
!$acc parallel private(ip,k,zfac)
DO  k = 1, nlev
   !$acc loop gang vector
   DO  ip = 1,N
      zfac = g/(cp_d*dp0 (ip,k))
      sohr(ip,k) = 0.0
      IF (zsmu0(ip) > zepemu) THEN
         sohr(ip,k) = zfac * (zfls(ip,k)-zfls(ip,k+1))
      ENDIF
      thhr(ip,k)   = zfac * (zflt(ip,k)-zflt(ip,k+1))
   ENDDO
END DO
!$acc end parallel 
!$acc update host(sohr,thhr)
!$acc end data

print*, 'Max Diff with private CPU/GPU: sohr', maxval(abs(sohr_ref-sohr))
print*, 'Max Diff with private CPU/GPU: thhr', maxval(abs(thhr_ref-thhr))


end program main

Compiling and running shows different results when using the private directive:

> pgf90 -acc -o test_private_12_6 test_private_12_6.f90
> ./test_private_12_6
 Max Diff without private CPU/GPU: sohr    0.000000000000000     
 Max Diff without private CPU/GPU: thhr    0.000000000000000     
 Max Diff with private CPU/GPU: sohr   6.1529741284087325E-004
 Max Diff with private CPU/GPU: thhr   4.6147305963065491E-004

Xavier

Hi Xavier,

A private on a parallel construct declares that a copy of the listed variables are created for each gang. This means that every vector in the gang will share the same variable. In your case, this means that every vector is using the same “zfac” variable and hence clobbering each others value.

Since scalars are privatized by default, I recommend to not explicitly privatize them unless absolutely necessary (such as if you’re getting a live-out error). Granted, this is very confusing especially since OpenMP users are use to privatizing scalars. So I’m wondering if the compiler should just “do the right thing” in this context and make the scalar a local variable in the kernel (essentially ignoring the private clause). I’ll sent the question off to Michael.

  • Mat

Hi Mat,

Thanks for the clarification, I didn’t saw when reading the OpenACC specification that private had a different meaning on the parallel and on the loop construct.

Also I think the documentation is somewhat confusion p.10 OpenACC 1.0:

A scalar variable referenced in the parallel construct that does not appear in a data clause for the construct or any enclosing data construct will be treated as if it appeared in a private clause (if not live-in or live-out) or a copy clause for the parallel construct.

This is why I thought that the private statment on the parallel construct would work.

In case I would need to privatize a scalar (private to each worker) I should then add the private directive to the “!$acc loop” statment ?

Xavier

Hi Mat,
I’m actually running in such an error or warning (I’m not sure what it really is), so I’d like to ask what is the best practice; I have a loop like this:
!$acc parallel
!$acc loop
DO j = 1, je
!$acc loop
DO i = 1, ie
zaiu (i,j) = c0
zaiv (i,j) = c0
zrotv (i,j) = c0
ENDDO
ENDDO
!$acc end parallel

Where c0 is a constant, this is basically an initialization. The first line is line 4612 in my code.
Compiling like that I would get the following messages from the compiler (12.6):

4612, Accelerator kernel generated
4612, CC 2.0 : 22 registers; 0 shared, 104 constant, 0 local memory bytes
4614, !$acc loop gang ! blockidx%x
4612, Generating copyout(zaiu(1:ie,1:je))
Generating copyout(zaiv(1:ie,1:je))
Generating copyout(zrotv(1:ie,1:je))
Generating compute capability 2.0 binary
4614, Accelerator restriction: scalar variable live-out from loop: i
4617, Accelerator restriction: induction variable live-out from loop: j
Accelerator restriction: induction variable live-out from loop: i
4618, Accelerator restriction: induction variable live-out from loop: j
Accelerator restriction: induction variable live-out from loop: i
4619, Accelerator restriction: induction variable live-out from loop: j
Accelerator restriction: induction variable live-out from loop: i
4620, Accelerator restriction: induction variable live-out from loop: i
4621, Accelerator restriction: induction variable live-out from loop: j

The only combination where I don’t get such messages is:

!$acc parallel
!$acc loop private (i,j)
DO j = 1, je
!$acc loop
DO i = 1, ie
zaiu (i,j) = c0
zaiv (i,j) = c0
zrotv (i,j) = c0
ENDDO
ENDDO
!$acc end parallel

I get (skipping copy-out):
4612, Accelerator kernel generated
4612, CC 2.0 : 18 registers; 0 shared, 120 constant, 0 local memory bytes
4614, !$acc loop gang ! blockidx%x
4616, !$acc loop vector(256) ! threadidx%x

But I’m not sure that it is correct that “i” has to be declared private in the outer loop, does that mean that “i” is privatized per every gang? I would want “i” to be private per each worker, no?
So I thought the most logic solution would be:
!$acc parallel
!$acc loop private (j)
DO j = 1, je
!$acc loop private (i)
DO i = 1, ie
zaiu (i,j) = c0
zaiv (i,j) = c0
zrotv (i,j) = c0
ENDDO
ENDDO
!$acc end parallel
but in this case I get:
4612, Accelerator kernel generated
4612, CC 2.0 : 14 registers; 0 shared, 120 constant, 0 local memory bytes
4614, !$acc loop gang ! blockidx%x
4616, !$acc loop vector(256) ! threadidx%x
4612, Generating copyout(zaiu(1:ie,1:je))
Generating copyout(zaiv(1:ie,1:je))
Generating copyout(zrotv(1:ie,1:je))
Generating compute capability 2.0 binary
4614, Accelerator restriction: scalar variable live-out from loop: i
4616, Loop is parallelizable

Why do I get a live-out for “i” in the declaration of the outer loop on “j”? And what is the best practice?
Thank you
Tiziano

Hi Xavier,

In case I would need to privatize a scalar (private to each worker) I should then add the private directive to the “!$acc loop” statment ?

Yes, but I don’t like this either and agree that the OpenACC Spec needs to cleaned-up. Michael is aware and will bring it up to the OpenACC committee.

  • Mat

Hi Tiziano,

But I’m not sure that it is correct that “i” has to be declared private in the outer loop, does that mean that “i” is privatized per every gang? I would want “i” to be private per each worker, no?

Loop index variables are treated a bit differently since privatizing them across the gang wouldn’t be possible. So in this particular case, it will work.

Why do I get a live-out for “i” in the declaration of the outer loop on “j”? And what is the best practice?

It’s either compiler error or i and j’s values are used after the end of the accelerator region. What’s the code like after the region?

  • Mat

I have other loops using again i and j, but I have a parallel region per every loop, because many times the loop count is different.
So I don’t understand why the compiler gives me this messages, also because I have a very similar situation in other routines, but there I haven’t the messages. My feeling is that for some reasons the compiler starts to see wrong connections for some unknows reasons, and this is not easy to reproduce in an example. The routine where it happens is the last one in my file, could it help if I copy-paste in a new file?

My feeling is that for some reasons the compiler starts to see wrong connections for some unknows reasons, and this is not easy to reproduce in an example. The routine where it happens is the last one in my file, could it help if I copy-paste in a new file?

It possible that the compiler is getting confused or there may be some other reason. If you can post a reproducing example, that would be great. If it’s too large to post, please send the example to PGI Customer Service (trs@pgroup.com) and ask them to send it to me.

Thanks,
Mat

Hi,
I found out the problem: there was a routine cointained in the routine where these strange messages appeared, and the contained routine used the same i and j indexes, my bet is that the programmer thought it to be always inlined, but I guess this confused the compiler.
Best regards
Tiziano

Hi Mat,

I have the same problem discussed in this post and I would like to know the best solution according to your view.

I am using PGI 12.8 (12.6 is the same, while 12.5 works for this part) and PGI Accelerator. I need to privatize some scalars because I have a live-out message (the compiler fails about that but I think it is reasonable because the program is complex). Since the loops are nested, if I privatize scalars in the external loop, results are wrong, probably because privatization occurs at the block level. If I privatize the internal loop the results seem to be correct but I wonder if:
a) I can be always safe privatizing the most internal parallelized nested loop
b) this strategy is efficient in view of code performance, or it is better to check each loop and privatize as little as possibile (quite laborious to do).

My personal view is that “private” should be
a) local to kernel when no loop specification is given (gang, worker,…, or similar for PGI Acc)
b) corresponding to the specification of the loop only when explicitely given by the programmer

thanks for help
bye, Francesco