Unknown reason for sequential execution

Hi everyone,

I do not understand why the jl loop is executed sequentially with PGI 17.10 while it gets parallelized with the cray compiler. We use the second version below now which works for both but still it would be interesting to know.

    !$acc parallel                                                              
    !$acc loop seq                                                              
    DO jk = itop,klevm1                                                         
      ztest = 0._wp                                                             
      !$acc loop gang vector reduction(+:ztest)                                 
      DO jl = 1,kproma                                                          
        ptke(jl,jk) = bb(jl,jk,itke) + tpfac3*pztkevn(jl,jk)                    
        ztest = ztest+MERGE(1._wp,0._wp,ptke(jl,jk)<0._wp)                      
      END DO                                                                    
      IF(ztest.NE.0._wp) THEN                                                   
              exit                                                              
      ENDIF                                                                     
    END DO                                                                      
    !$acc end parallel



    ztest = 0._wp                                                               
    !$acc parallel                                                              
    !$acc loop seq                                                              
    DO jk = itop,klevm1                                                         
      !$acc loop gang vector reduction(+:ztest)                                 
      DO jl = 1,kproma                                                          
        ptke(jl,jk) = bb(jl,jk,itke) + tpfac3*pztkevn(jl,jk)                    
        ztest = ztest+MERGE(1._wp,0._wp,ptke(jl,jk)<0._wp)                      
      END DO                                                                    
    END DO                                                                      
    !$acc end parallel                                                          
                                                                                
    IF(ztest.NE.0._wp) THEN                                                     
      CALL finish('vdiff_tendencies','TKE IS NEGATIVE')                         
    ENDIF

Thank you for your answer.

Hi te85kibe,

I believe the problem with the first example is that you have a reduction on the inner gang loop. Since there’s no global synchronization, gang level reductions don’t get updated until after the parallel region. Hence, you can’t use “ztest” after the “j1” loop and still have the inner loop use a “gang”.

As for Cray, it’s unclear what they’re doing to parallelize the inner loop and still get correct answers since the code doesn’t conform. Maybe they’re ignoring “gang” and just running across the vectors?

Another thing to keep in mind is that all code between the start of the parallel region and the gang loop is run in gang-redundant mode. Hence, every gang will be running every iteration of the “jk” which in this case will lead to extra computation. Is there a reason why in the second example you’re not parallelizing the “jk” loop?

-Mat

Thank you for your fast reply. This is the output I got for the Cray compiler:

    DO jk = itop,klevm1
ftn-6412 crayftn: ACCEL VDIFF_TENDENCIES, File = ../../../src/mo_vdiff_solver.f90, Line = 887 
  A loop starting at line 887 will be redundantly executed.

      DO jl = 1,kproma
ftn-6430 crayftn: ACCEL VDIFF_TENDENCIES, File = ../../../src/mo_vdiff_solver.f90, Line = 890 
  A loop starting at line 890 was partitioned across the threadblocks and the 128 threads within a threadblock.

The outer loop has only 47 iterations whereas the jl loop has more than 80’000 iterations. From my experience, a gang parallelization on the outer loop is much slower than only parallelizing the inner loop.

Does it make sense to put the “!$acc parallel” inside the outer loop to avoid redundant execution or is the overhead of creating a new parallel region in each iteration to big ?

Does it make sense to put the “!$acc parallel” inside the outer loop to avoid redundant execution or is the overhead of creating a new parallel region in each iteration to big ?

I would assume that it’s better to put the parallel region around the inner loop and not offload the outer loop. As long as you have an outer data region so the arrays aren’t getting copied over each time then the overhead wont be bad.

-Mat