Help figuring out why acc parallel does not work

Hi,

I’ve been trying to use OpenACC with a simple code, which you can find (tar’ed) at https://goo.gl/XRrXR6.

As it is, the parallelization is done with !$acc kernels, and the results of both non-openACC and OpenACC versions are identical:

[angelv@deimos]$ source comp.sh
[angelv@deimos]$ diff zcs.cpu zcs.acc
[angelv@deimos]$ 
[...]

Because I want to have more control on how the loops in the subroutine zcs are parallelized I try to remove the !$acc kernels directive and change it with !$acc parallel and !$acc loops. The changes are minimal:

[angelv@deimos]$ diff rii.f90 rii_parallel.f90 
36c36,37
<     !$acc kernels
---
>     !$acc parallel
>     !$acc loop private(k,kp,k2,km,kp2,z0,q,q2)
47c48
< 
---
>              !$acc loop private(mu2,ml2,p2,z1,mup2,pp2,z2,mlp2,ps2,pt2,z3,z4,z5,z6,z7)
79c80
<     !$acc end kernels
---
>     !$acc end parallel
[angelv@deimos]$

But if I use rii_parallel.f90 instead of rii.f90, then the results of the zc matrix are very different from both versions. I guess I don’t fully understand how to nest several loops, and/or what private really does. Any suggestions to help me properly understand what is going on?

Thanks,
AdV

Hi,

OK, after spending too much time on this, I found something very weird. If in the rii_parallel.f90 file from the previous post I substitute inside the !$acc parallel region all appearances of ju2 with the number 3, and all appearances of jl2 with the number 1 (which are the values passed to the subroutine zcs, then the code works again as expected and both the OpenACC and the serial version give the same result.

ju2 and jl2 are just integers (defined in globals.f90) and moved to the device before calling zcs with:

!$acc data copy(zc) copyin(ju2,jl2)
!$acc update device(fact)
CALL zcs(zc,kmin,kmax,ju2,jl2)
!$acc end data
PRINT*, zc

So, it looks like those two variables are somehow not defined properly in the device. To me this looks like a bug in the compiler. Any way to overcome this?

Thanks,
AdV

Hi AdV,

The core problem is that you’re passing “ju2” and “jl2” by reference (the default in Fortran) to “W3JS”. Since these are loops bounds variables, the compiler must assume that the values could change. In turn, this prevents the inner loops from being parallelized since the loop trip count is not known.

The simple fix is to pass these variables by value:

  FUNCTION W3JS(J1,J2,J3,M1,M2,M3)
    !$acc routine seq
    IMPLICIT NONE

    INTEGER, VALUE, INTENT(IN) :: J1, J2, J3, M1, M2, M3
!    INTEGER, INTENT(IN) :: J1, J2, J3, M1, M2, M3
    INTEGER :: IA, IB, IC, ID, IE, IF, IG, IH

-Mat

Hi Mat,

adding VALUE to the input variables in function W3JS does indeed solve the problem, but I don’t understand why…

  • You say “this prevents the inner loops from being parallelized since the loop trip count is not known.”. OK, but I would assume that the results would be correct if the inner loops cannot be parallelized. I don’t get just slower code, I get incorrect code, which is obviously more worrying.

  • The output when compiling (pgf90 -ta=tesla:cc60,time,lineinfo -acc -Minfo=all,ccff -Mneginfo=all) is exactly the same in both cases (pgf90 18.3-0 64-bit target on x86-64 Linux -tp haswell). Is there any way for the compiler to give information about these type of things?

  • Somehow I had assumed that having the arguments in W3JS declared as INTENT(IN) should be enough. Is there any document where I could understand why I also have to add VALUE to pass the arguments by value in order to get correct results? To my understanding it should make no difference, since these variables ju2 and jl2 never change. They are used to define the values of some of the loops, but the variables themselves never change value, so they are just read-variables. Where does the race condition come from?

Thanks,
AdV

Somehow I had assumed that having the arguments in W3JS declared as INTENT(IN) should be enough.

Technically it should but we don’t keep the intent information in the module so the compiler doesn’t that the data is read-only. I’ve added an RFE (TPR#25525) to see if we can use this information in this case. For now, please use the “VALUE” attribute.

The output when compiling (pgf90 -ta=tesla:cc60,time,lineinfo -acc -Minfo=all,ccff -Mneginfo=all) is exactly the same in both cases (pgf90 18.3-0 64-bit target on x86-64 Linux -tp haswell). Is there any way for the compiler to give information about these type of things?

Sorry, but I’m not clear on what you’re asking. Can you please restate the question?

-Mat

OK, though I still don’t understand why it is an issue if they are read-only variables…

Just wondering if there is some way that the compiler is able to warn about the danger of not putting VALUE there (given that I don’t understand yet why it works, and that the compiler output is identical with or without the VALUE attribute, I can see falling in the same trap at a later time, so if the compiler is able to provide any help, it would be great.

Many thanks,
AdV

OK, though I still don’t understand why it is an issue if they are read-only variables…

Because this information isn’t kept in the module, so the compiler doesn’t know that the variables are read-only when calling the routine. The RFE is to see if we can keep the intent value in the module for use later.

I can see falling in the same trap at a later time, so if the compiler is able to provide any help, it would be great.

Unfortunately, the compiler can’t really tell you how to fix your code to make it parallelizable, only that it’s unable to parallelize it.